About WcodeW

A benchmark + viewer for measuring how faithfully an LLM can re-implement a real webpage from a static specification — without ever touching the live URL.

The loop

Seal a real page. A capture pipeline (Playwright) saves the page's DOM, accessibility tree, every network response, and full-page screenshots at desktop + mobile viewports across three scroll positions (hero, scroll_50, scroll_end). Output goes to wclone/<category>/<id>/sealed/.
Hand the agent only the spec. What the agent actually receives is in agent_input/: the manifest, the public interaction script, and a short brief — no live URL, no original assets. The agent's job is to write a single self-contained index.html.
Score the clone. The pipeline renders the agent's HTML through the same capture rig and compares it to the sealed snapshot on four axes (see below). Outputs land in scoring-agent-<experiment-tag>/score.json.

Scoring

The composite overall score is a weighted sum:

Dimension	Weight	What it measures
`visual`	0.50	Pixelmatch similarity between sealed + agent screenshots, averaged across viewports × steps.
`dom`	0.30	Jaccard similarity over a11y trees + cosine similarity over tag-bag vectors.
`interaction`	0.05	Whether each current hero/scroll checkpoint executed without error and the post-state visual match.
`aj`	0.15	"Agentic judge" — a separate LLM scores natural-language assertions about the page (layer 1 / 2 / 3 invariants).

The viewer here adds a fifth signal not in the composite: a live diff % from a per-pixel comparison (max-channel threshold 10 of 255). It's a lighter, more intuitive read on visual fidelity than the aggregate visual score, useful for spotting where the clone visibly drifts.

The four view modes

iframe: Both the rehydrated reference and the agent's clone render in real iframes. Step selector scrolls each iframe to the manifest fraction so you can compare layout at any scroll depth.
screenshot: Static PNG pairs from the sealed capture. Fastest mode and the most pixel-faithful: same-resolution, same-viewport, no font loading flicker.
diff: Screenshot mode with a red/amber mask painted over the agent side wherever it disagrees with the sealed pixels.
code: The actual HTML source side-by-side. Closes the web → code → web loop literally — see what the agent wrote vs. what the canonical DOM contains.

The asset compression metric

sealed/network/index.json records every byte the original page pulled (typically 20-50 files / 200-500 KB). The agent's clone is a single inline index.html, usually 5-15 KB. The compare panel shows both totals side by side — a reminder that "clone" here means "static visual replica," not "drop-in functional replacement."

Limitations

Sealed assets are referenced via SHA-256 hashes; the rehydrated reference iframe needs them locally — works on GitHub Pages because the deploy mirror copies sealed/assets/.
JavaScript inside both iframes is sandboxed off (sandbox="allow-same-origin") — you'll see CSS + DOM but no JS-driven motion.
diff % is a per-pixel max-channel threshold, not perceptual. A 1-pixel offset in body text can balloon the ratio without meaning much. The aggregate visual score is the more honest metric for reasoning about quality.
Mobile screenshots may be captured at 2× DPR while sealed ones are at 1×; the viewer scales both to the smaller dimensions before comparing.

Adding a new bundle or agent run

See the annotator playbook for the bundle-creation recipe. To add another agent's output for an existing bundle, drop the generated HTML at wclone/<cat>/<id>/submission-agent-<experiment-tag>/index.html, run pipeline/bin/wclone-score.ts against it, then regenerate the seeds index with pipeline/bin/wclone-build-index.ts --source wclone.

Keyboard shortcuts

Key	Page	What it does
`←` / `→`	compare	Nudge the slider 1 % (Shift = 5 %)
`Home` / `End`	compare	Snap the slider to 0 % / 100 %
`j` / `k`	compare, matrix	Cycle to next / previous bundle
`?`	compare	Toggle the keyboard shortcut cheat-sheet toast
`/`	gallery	Focus the search box
`Esc`	v2 viewer	Close the compare modal
`d`	any page	Toggle dark / light theme
`1`–`4`	compare	Switch render mode (iframe / screenshot / diff / code)

Data download

For analysis in a spreadsheet or pandas, the entire (bundle × agent_run) metric set is available as a flat CSV at wclone-export.csv — one row per agent run, columns for every score, per-cell diff%, byte / element compression ratios, and prompt / response sizes. The raw JSON indexes are also linkable: wclone-seeds.json, wclone-diff-index.json, wclone-dom-index.json.

Source

MIT-licensed at github.com/reacher-z/WcodeW. The viewer is plain HTML + ES modules + a single shared stylesheet, deliberately dep-free so it works offline and on GitHub Pages without a build step.