Skip to content
ATAI Today Brief
HomeNewsConceptsGuidesToolbox
AboutSubscribeUA
Subscribe

AI Today Brief

The daily AI-engineering brief. Built in public. EN · UA.

XTelegramLinkedInYouTubeRSS
NewsConceptsGuidesSubscribeAdvertiseAboutEditorial policyAI disclosurePrivacyTerms

© 2026 AI Today Brief. All rights reserved.

  1. Home/
  2. News/
  3. Models & research/
  4. CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models
Models & research

CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

July 2, 2026· 5 min read
OKCurated by Oleksandr Kuzmenko, AI Product Engineer·Updated July 2, 2026·Sources cited on every story
AI-assisted · editor-reviewed·How we use AI
CursorBench 3.1 evaluates cost and efficiency of elite agentic coding models

Cursor published CursorBench 3.1, comparing leading LLMs across complex codebase editing and planning tasks. The data reveals massive variance in real-world API token costs and execution steps.

Impact: Medium

Why it matters

Understanding the exact cost-to-performance ratio of different model tiers helps companies optimize their spending on automated coding agents.

TL;DR

  • 01Fable 5 Extra High achieved the highest score of 72.0% but with a high average cost of $13.74 per task.
  • 02Gemini 3.5 Flash is highly economical at $1.94 per task, though with a lower score of 49.8%.
  • 03Small score differences between models may not be statistically significant due to execution variance.

Key facts

Fable 5 Extra High Score72.0%
GPT-5.5 Extra High Score64.3%
Fable 5 Extra High Score
72.0%
Fable 5 Extra High Cost/Task
$13.74
GPT-5.5 Extra High Score
64.3%
GPT-5.5 Extra High Cost/Task
$4.37

Codebase Editing and Planning Focus

CursorBench 3.1 features updates focused on deep codebase understanding, bugfinding, planning, and code review. This benchmark improves grading criteria for edit tasks, expanding on an initial set of tasks that targeted edit, refactor, and bugfix problems.

The Cost of Multi-Step Execution

The benchmark data shows cost discrepancies across models for identical tasks:

  • Fable 5 Extra High: 72.0% score | $13.74 average cost per task | 48,754 tokens
  • GPT-5.5 Extra High: 64.3% score | $4.37 average cost per task | 17,905 tokens
  • Sonnet 5 Extra High: 58.4% score | $5.23 average cost per task | 58,228 tokens
  • Gemini 3.5 Flash: 49.8% score | $1.94 average cost per task | 35,105 tokens

Computing Average Costs

The average cost per task is computed by applying each model's published pricing (including input, cache read, cache write, and output) to the tokens used on each task. Because these results are subject to variance, small differences in scores may not be statistically meaningful.

#CursorBench#Gemini 3.5 Flash#GPT-5.5#Fable 5#Sonnet 5
ShareShare on XShare on LinkedIn
Next story →GitHub showcases Qubot, an internal Copilot-powered data analytics assistant

Related stories

  • Models & researchNVIDIA Releases Nemotron-Labs-TwoTower for Accelerated Inference
  • Models & researchAnthropic releases Claude Sonnet 5
  • Models & researchDiScoFormer: One-Pass Density and Score Estimation Transformer
  • Models & researchSpecialization is inevitable in AI performance optimization

Email digest

Get the morning AI brief

One email a day — the stories that matter for engineers, founders and tech leads. Human-edited, with links to primary sources.

  • ✓120+ sources scanned daily
  • ✓Edited by a human
  • ✓1 email per day
  • ✓EN + UA

By subscribing you agree to the privacy policy.