About WcodeW
A benchmark + viewer for measuring how faithfully an LLM can re-implement a real webpage from a static specification — without ever touching the live URL.
The loop
-
Seal a real page. A capture pipeline (Playwright)
saves the page's DOM, accessibility tree, every network response,
and full-page screenshots at desktop + mobile viewports across
three scroll positions (
hero,scroll_50,scroll_end). Output goes towclone/<category>/<id>/sealed/. -
Hand the agent only the spec. What the agent
actually receives is in
agent_input/: the manifest, the public interaction script, and a short brief — no live URL, no original assets. The agent's job is to write a single self-containedindex.html. -
Score the clone. The pipeline renders the agent's
HTML through the same capture rig and compares it to the sealed
snapshot on four axes (see below). Outputs land in
scoring-agent-<model>/score.json.
Scoring
The composite overall score is a weighted sum:
| Dimension | Weight | What it measures |
|---|---|---|
visual | 0.50 | Structural-similarity (SSIM) between sealed + agent screenshots, averaged across viewports × steps. |
dom | 0.30 | Jaccard similarity over a11y trees + cosine similarity over tag-bag vectors. |
interaction | 0.05 | Whether each manifest interaction (scrolls, clicks) executed without error and the post-state visual match. |
aj | 0.15 | "Agentic judge" — a separate LLM scores natural-language assertions about the page (layer 1 / 2 / 3 invariants). |
The viewer here adds a fifth signal not in the composite: a
live diff % from a per-pixel comparison
(max-channel threshold 10 of 255). It's a lighter, more
intuitive read on visual fidelity than SSIM, useful for
spotting where the clone visibly drifts.
The four view modes
- iframe
- Both the rehydrated reference and the agent's clone render in real iframes. Step selector scrolls each iframe to the manifest fraction so you can compare layout at any scroll depth.
- screenshot
- Static PNG pairs from the sealed capture. Fastest mode and the most pixel-faithful: same-resolution, same-viewport, no font loading flicker.
- diff
- Screenshot mode with a red/amber mask painted over the agent side wherever it disagrees with the sealed pixels.
- code
- The actual HTML source side-by-side. Closes the web → code → web loop literally — see what the agent wrote vs. what the canonical DOM contains.
The asset compression metric
sealed/network/index.json records every byte the
original page pulled (typically 20-50 files / 200-500 KB). The
agent's clone is a single inline index.html,
usually 5-15 KB. The compare panel shows both totals side by
side — a reminder that "clone" here means "static visual
replica," not "drop-in functional replacement."
Limitations
- Sealed assets are referenced via SHA-256 hashes; the
rehydrated reference iframe needs them locally — works on
GitHub Pages because the deploy mirror copies
sealed/assets/. - JavaScript inside both iframes is sandboxed off
(
sandbox="allow-same-origin") — you'll see CSS + DOM but no JS-driven motion. diff %is a per-pixel max-channel threshold, not perceptual. A 1-pixel offset in body text can balloon the ratio without meaning much. SSIM is the more honest metric for reasoning about quality.- Mobile screenshots may be captured at 2× DPR while sealed ones are at 1×; the viewer scales both to the smaller dimensions before comparing.
Adding a new bundle or agent run
See the
annotator playbook for the bundle-creation recipe. To add
another agent's output for an existing bundle, drop the
generated HTML at
wclone/<cat>/<id>/submission-agent-<safe-model>/index.html,
run pipeline/bin/wclone-score.ts against it, then
regenerate the seeds index with
pipeline/bin/wclone-build-index.ts.
Keyboard shortcuts
| Key | Page | What it does |
|---|---|---|
← / → | compare | Nudge the slider 1 % (Shift = 5 %) |
Home / End | compare | Snap the slider to 0 % / 100 % |
j / k | compare, matrix | Cycle to next / previous bundle |
? | compare | Toggle the keyboard shortcut cheat-sheet toast |
/ | gallery | Focus the search box |
Esc | v2 viewer | Close the compare modal |
d | any page | Toggle dark / light theme |
1–4 | compare | Switch render mode (iframe / screenshot / diff / code) |
Data download
For analysis in a spreadsheet or pandas, the entire (bundle × agent_run) metric set is available as a flat CSV at wclone-export.csv — one row per agent run, columns for every score, per-cell diff%, byte / element compression ratios, and prompt / response sizes. The raw JSON indexes are also linkable: wclone-seeds.json, wclone-diff-index.json, wclone-dom-index.json.
Source
MIT-licensed at github.com/reacher-z/WcodeW. The viewer is plain HTML + ES modules + a single shared stylesheet, deliberately dep-free so it works offline and on GitHub Pages without a build step.