Leaderboard
Each row is one submission family (LLM agent or baseline) scored across
all bundles for which it has data. mean ± 95% CI from 1000-iter
bootstrap. Click a row to expand the per-category breakdown.
| # | Submission | N | overall ▼ | overall (indep judge) | overall ÷ ref | visual | dom | interaction | aj (opus) | aj (indep) |
|---|
Per-category overall (mean)
Cells are colour-coded by overall score: red < 0.4, amber 0.4–0.6, green ≥ 0.6.
Pairwise differences (overall)
Paired-bundle bootstrap of A − B; 95% CI excluding 0 = statistically reliable separation.