WcodeW v2 · about

About WcodeW

A benchmark + viewer for measuring how faithfully an LLM can re-implement a real webpage from a static specification — without ever touching the live URL.

The loop

  1. Seal a real page. A capture pipeline (Playwright) saves the page's DOM, accessibility tree, every network response, and full-page screenshots at desktop + mobile viewports across three scroll positions (hero, scroll_50, scroll_end). Output goes to wclone/<category>/<id>/sealed/.
  2. Hand the agent only the spec. What the agent actually receives is in agent_input/: the manifest, the public interaction script, and a short brief — no live URL, no original assets. The agent's job is to write a single self-contained index.html.
  3. Score the clone. The pipeline renders the agent's HTML through the same capture rig and compares it to the sealed snapshot on four axes (see below). Outputs land in scoring-agent-<model>/score.json.

Scoring

The composite overall score is a weighted sum:

DimensionWeightWhat it measures
visual0.50 Structural-similarity (SSIM) between sealed + agent screenshots, averaged across viewports × steps.
dom0.30 Jaccard similarity over a11y trees + cosine similarity over tag-bag vectors.
interaction0.05 Whether each manifest interaction (scrolls, clicks) executed without error and the post-state visual match.
aj0.15 "Agentic judge" — a separate LLM scores natural-language assertions about the page (layer 1 / 2 / 3 invariants).

The viewer here adds a fifth signal not in the composite: a live diff % from a per-pixel comparison (max-channel threshold 10 of 255). It's a lighter, more intuitive read on visual fidelity than SSIM, useful for spotting where the clone visibly drifts.

The four view modes

iframe
Both the rehydrated reference and the agent's clone render in real iframes. Step selector scrolls each iframe to the manifest fraction so you can compare layout at any scroll depth.
screenshot
Static PNG pairs from the sealed capture. Fastest mode and the most pixel-faithful: same-resolution, same-viewport, no font loading flicker.
diff
Screenshot mode with a red/amber mask painted over the agent side wherever it disagrees with the sealed pixels.
code
The actual HTML source side-by-side. Closes the web → code → web loop literally — see what the agent wrote vs. what the canonical DOM contains.

The asset compression metric

sealed/network/index.json records every byte the original page pulled (typically 20-50 files / 200-500 KB). The agent's clone is a single inline index.html, usually 5-15 KB. The compare panel shows both totals side by side — a reminder that "clone" here means "static visual replica," not "drop-in functional replacement."

Limitations

Adding a new bundle or agent run

See the annotator playbook for the bundle-creation recipe. To add another agent's output for an existing bundle, drop the generated HTML at wclone/<cat>/<id>/submission-agent-<safe-model>/index.html, run pipeline/bin/wclone-score.ts against it, then regenerate the seeds index with pipeline/bin/wclone-build-index.ts.

Keyboard shortcuts

KeyPageWhat it does
/ compareNudge the slider 1 % (Shift = 5 %)
Home / EndcompareSnap the slider to 0 % / 100 %
j / kcompare, matrixCycle to next / previous bundle
?compareToggle the keyboard shortcut cheat-sheet toast
/galleryFocus the search box
Escv2 viewerClose the compare modal
dany pageToggle dark / light theme
14compareSwitch render mode (iframe / screenshot / diff / code)

Data download

For analysis in a spreadsheet or pandas, the entire (bundle × agent_run) metric set is available as a flat CSV at wclone-export.csv — one row per agent run, columns for every score, per-cell diff%, byte / element compression ratios, and prompt / response sizes. The raw JSON indexes are also linkable: wclone-seeds.json, wclone-diff-index.json, wclone-dom-index.json.

Source

MIT-licensed at github.com/reacher-z/WcodeW. The viewer is plain HTML + ES modules + a single shared stylesheet, deliberately dep-free so it works offline and on GitHub Pages without a build step.