ATB Orchestration Bench: our reproducible agent-delivery benchmark
Last verified: June 11, 2026 · Edited by Oleksandr Kuzmenko
Every notable coding-agent release gets the same end-to-end delivery epic — planning, memory, token economy, code, self-review, tests, design, tech-debt honesty. Scores and run logs live on this page.
Leaderboards measure models on isolated problems. Real agentic work is different: it is orchestrated delivery — decomposing an epic, surviving a session restart, staying inside a token budget, reviewing your own code, shipping tests, and being honest about the debt you left behind. That is what this benchmark scores.
The fixed epic
Every tool gets the same one-page spec: "Standup Tracker" — a complete product slice (team CRUD, daily entries, an analytics dashboard, responsive UI per a fixed design spec) built from scratch in a fixed reference repo. Web track on Next.js/TypeScript; a mobile track on Expo exists for tools that claim it.
What makes it hard — and fair
- A mid-run session restart is mandatory. Long-term memory is scored by how much context survives without re-explaining. Configuring the tool's own memory, rules and skills is allowed — that *is* orchestration — but every configuration is published in the run log.
- A hard token budget is set before the run. Closing the epic on half the budget scores 5; running out ends the run.
- Every human intervention is logged and costs points. The goal is "під ключ", not pair programming.
- Same repo commit, same spec, same budget for every tool; exact model ids and versions recorded; full run logs published.
The 10 dimensions (0–5 each, max 50)
1. Planning & decomposition 2. Long-term memory (the restart test) 3. Token economy 4. Code quality (typecheck/lint/idiom) 5. Self-review & critique 6. Tests (unit + Playwright e2e, all green) 7. Design fidelity 8. Tech-debt honesty 9. Autonomy (interventions counted) 10. Time to done
We publish the profile, not just the total: a tool that scores 46 but fails memory is a different animal from a 46 that fails design.
Results
No scored runs yet — the protocol is frozen as v1 and the first runs are scheduled. Each future run adds a row here: tool, version, date, total, dimension profile, cost, and a link to the full run log.
Why trust this
The protocol, the epic spec and every run log are public and versioned. When we change anything, the protocol version bumps and old scores stop being comparable — no silent re-grading.