Leaderboard

Each row is one submission family (LLM agent or baseline) scored across all bundles for which it has data. mean ± 95% CI from 1000-iter bootstrap. Click a row to expand the per-category breakdown.

loading…

Exclude bot-challenge bundles (…)

#	Submission	N	overall ▼	overall (indep judge)	overall ÷ ref	visual	dom	interaction	aj (opus)	aj (indep)

Per-category overall (mean)

Cells are colour-coded by overall score: red < 0.4, amber 0.4–0.6, green ≥ 0.6.

Leaderboard

Per-category overall (mean)

Pairwise differences (overall)