Main results

Single-page summary, screenshot-ready for the paper's main figure. All numbers exclude the 12 bot-challenged bundles (8.2 % of the scored set) — the 134-bundle effective evaluation set.

loading…

Table 1. Overall score by condition

#	Condition	Interface	N	overall	visual	dom	aj (indep)	overall ÷ ref

Figure 1. Bar chart of overall scores

Notes

Practical ceiling. The reference rehydrate scores 0.71, not 1.0, due to sub-pixel font drift in chromium between seal-time and eval-time. Models should be measured relative to 0.71.
Independent judge. Each LLM agent's AJ column uses the *other* frontier judge — gemini-2.5-pro judges non-gemini agents, opus-4.7 judges gemini. Same-family judging produces ≈ +0.33 mean aj inflation.
Two of three single-shot LLMs lose to the dom-copy baseline (0.463). Only the agentic interface clears the trivial-baseline barrier.

Data: wclone-export.csv (per-row scores) · wclone-seeds.json (manifest) · source: reacher-z/WcodeW.