Pilot bundles · pipeline scoreboard
Each card runs the pilot's reference implementation through the structural-invariant pipeline and lights up its score breakdown. Click a card for the per-layer invariants, NL assertions, and reference screenshots.
Scoring weights are pre-calibration placeholders; β can be 0 when reference screenshots are missing, and the natural-language verifier (γ) runs in stub-PASS mode here while the OpenRouter free tier is rate-limited. Cards marked DRAFT are pending team review of pilot annotator assignment / open design questions; see REPORT-12H §3.
Coverage matrix
Overall scores per (model, task). Cells are colour-tiered: ≥ 0.95 0.80–0.95 < 0.80 and — means no run yet.
Loading task scores…
Recent CI runs
Last 5 GitHub Actions runs across the smoke + Pages workflows. Click any row for the full job logs.