CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

Models & research

July 2, 2026 5 min read

Curated by Oleksandr Kuzmenko, AI Product EngineerUpdated July 2, 2026Sources cited on every story

AI-assisted · editor-reviewedHow we use AI

CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

Cursor published CursorBench 3.1, comparing leading LLMs across complex codebase editing and planning tasks. The data reveals massive variance in real-world API token costs and execution steps.

Impact: Medium

Why it matters

Understanding the exact cost-to-performance ratio of different model tiers helps companies optimize their spending on automated coding agents.

TL;DR

01Fable 5 Extra High achieved the highest score of 72.0% but with a high average cost of $13.74 per task.
02Gemini 3.5 Flash is highly economical at $1.94 per task, though with a lower score of 49.8%.
03Small score differences between models may not be statistically significant due to execution variance.

Key facts

Fable 5 Extra High Score: 72.0%
Fable 5 Extra High Cost/Task: $13.74
GPT-5.5 Extra High Score: 64.3%
GPT-5.5 Extra High Cost/Task: $4.37

Codebase Editing and Planning Focus

CursorBench 3.1 features updates focused on deep codebase understanding, bugfinding, planning, and code review. This benchmark improves grading criteria for edit tasks, expanding on an initial set of tasks that targeted edit, refactor, and bugfix problems.

The Cost of Multi-Step Execution

The benchmark data shows cost discrepancies across models for identical tasks:

Fable 5 Extra High: 72.0% score | $13.74 average cost per task | 48,754 tokens
GPT-5.5 Extra High: 64.3% score | $4.37 average cost per task | 17,905 tokens
Sonnet 5 Extra High: 58.4% score | $5.23 average cost per task | 58,228 tokens
Gemini 3.5 Flash: 49.8% score | $1.94 average cost per task | 35,105 tokens

Computing Average Costs

The average cost per task is computed by applying each model's published pricing (including input, cache read, cache write, and output) to the tokens used on each task. Because these results are subject to variance, small differences in scores may not be statistically meaningful.

#CursorBench#Gemini 3.5 Flash#GPT-5.5#Fable 5#Sonnet 5

ShareShare on X Share on LinkedIn

Models & research

July 2, 2026 5 min read

Curated by Oleksandr Kuzmenko, AI Product EngineerUpdated July 2, 2026Sources cited on every story

AI-assisted · editor-reviewedHow we use AI

Cursor published CursorBench 3.1, comparing leading LLMs across complex codebase editing and planning tasks. The data reveals massive variance in real-world API token costs and execution steps.

Impact: Medium

Why it matters

Understanding the exact cost-to-performance ratio of different model tiers helps companies optimize their spending on automated coding agents.

TL;DR

01Fable 5 Extra High achieved the highest score of 72.0% but with a high average cost of $13.74 per task.
02Gemini 3.5 Flash is highly economical at $1.94 per task, though with a lower score of 49.8%.
03Small score differences between models may not be statistically significant due to execution variance.

Key facts

Fable 5 Extra High Score: 72.0%
Fable 5 Extra High Cost/Task: $13.74
GPT-5.5 Extra High Score: 64.3%
GPT-5.5 Extra High Cost/Task: $4.37

Codebase Editing and Planning Focus

The Cost of Multi-Step Execution

The benchmark data shows cost discrepancies across models for identical tasks:

Fable 5 Extra High: 72.0% score | $13.74 average cost per task | 48,754 tokens
GPT-5.5 Extra High: 64.3% score | $4.37 average cost per task | 17,905 tokens
Sonnet 5 Extra High: 58.4% score | $5.23 average cost per task | 58,228 tokens
Gemini 3.5 Flash: 49.8% score | $1.94 average cost per task | 35,105 tokens

Computing Average Costs

#CursorBench#Gemini 3.5 Flash#GPT-5.5#Fable 5#Sonnet 5

ShareShare on X Share on LinkedIn

CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

Codebase Editing and Planning Focus

The Cost of Multi-Step Execution

Computing Average Costs

Related stories

Get the morning AI brief

CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

Codebase Editing and Planning Focus

The Cost of Multi-Step Execution

Computing Average Costs

Related stories

Get the morning AI brief