CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models
Cursor published CursorBench 3.1, comparing leading LLMs across complex codebase editing and planning tasks. The data reveals massive variance in real-world API token costs and execution steps.
Impact: Medium
Why it matters
Understanding the exact cost-to-performance ratio of different model tiers helps companies optimize their spending on automated coding agents.
TL;DR
- 01Fable 5 Extra High achieved the highest score of 72.0% but with a high average cost of $13.74 per task.
- 02Gemini 3.5 Flash is highly economical at $1.94 per task, though with a lower score of 49.8%.
- 03Small score differences between models may not be statistically significant due to execution variance.
Key facts
- Fable 5 Extra High Score
- 72.0%
- Fable 5 Extra High Cost/Task
- $13.74
- GPT-5.5 Extra High Score
- 64.3%
- GPT-5.5 Extra High Cost/Task
- $4.37
Codebase Editing and Planning Focus
CursorBench 3.1 features updates focused on deep codebase understanding, bugfinding, planning, and code review. This benchmark improves grading criteria for edit tasks, expanding on an initial set of tasks that targeted edit, refactor, and bugfix problems.
The Cost of Multi-Step Execution
The benchmark data shows cost discrepancies across models for identical tasks:
- Fable 5 Extra High: 72.0% score | $13.74 average cost per task | 48,754 tokens
- GPT-5.5 Extra High: 64.3% score | $4.37 average cost per task | 17,905 tokens
- Sonnet 5 Extra High: 58.4% score | $5.23 average cost per task | 58,228 tokens
- Gemini 3.5 Flash: 49.8% score | $1.94 average cost per task | 35,105 tokens
Computing Average Costs
The average cost per task is computed by applying each model's published pricing (including input, cache read, cache write, and output) to the tokens used on each task. Because these results are subject to variance, small differences in scores may not be statistically meaningful.