Main results
Single-page summary, screenshot-ready for the paper's main figure. All numbers exclude the 12 bot-challenged bundles (8.2 % of the scored set) — the 134-bundle effective evaluation set.
Table 1. Overall score by condition
| # | Condition | Interface | N | overall | visual | dom | aj (indep) | overall ÷ ref |
|---|
Figure 1. Bar chart of overall scores
Notes
- Practical ceiling. The reference rehydrate scores 0.71, not 1.0, due to sub-pixel font drift in chromium between seal-time and eval-time. Models should be measured relative to 0.71.
- Independent judge. Each LLM agent's AJ column uses the *other* frontier judge — gemini-2.5-pro judges non-gemini agents, opus-4.7 judges gemini. Same-family judging produces ≈ +0.33 mean aj inflation.
- Two of three single-shot LLMs lose to the dom-copy baseline (0.463). Only the agentic interface clears the trivial-baseline barrier.
Data: wclone-export.csv (per-row scores) · wclone-seeds.json (manifest) · source: reacher-z/WcodeW.