WcodeW leaderboard

Leaderboard

Each row is one submission family (LLM agent or baseline) scored across all bundles for which it has data. mean ± 95% CI from 1000-iter bootstrap. Click a row to expand the per-category breakdown.

loading…

# Submission N overall ▼ overall (indep judge) overall ÷ ref visual dom interaction aj (opus) aj (indep)

Per-category overall (mean)

Cells are colour-coded by overall score: red < 0.4, amber 0.4–0.6, green ≥ 0.6.