SWE-bench-Verified score reaches 85%

Behind · Coding Automation · 85% confidence

Predicted: Mid 2025 · Updated: 2026-03-30 · Source: ai-2027.com, Late 2025 section; AI Futures grading

We forecast that mid-2025 agents will score 85% on SWEBench-Verified. (page 3, footnote 3. Note: the 72% starting point is tracker-added context, not from the source.)

What AI 2027 Predicted

The AI 2027 scenario implied that SWE-bench-Verified scores would reach 85% by mid-2025, starting from a baseline of roughly 72% at the time of writing. This was part of the broader narrative arc toward fully automated coding by early 2027 — SWE-bench progress was treated as a leading indicator of when AI would surpass human software engineers on standard tasks.

How We Track This

We monitor:

Official SWE-bench-Verified leaderboard scores
Epoch AI’s standardized SWE-bench-Verified evaluations (v2.0.0 methodology as of Feb 2026)
Individual model announcements from Anthropic, OpenAI, Google, and others
Scale AI’s SWE-Bench Pro results for harder task subsets

Current Evidence

Progress on SWE-bench-Verified has been significantly slower than the scenario predicted. The AI Futures Project’s own grading (Feb 2026) confirmed the best score at the time was 74.5% (Claude Opus 4.1), well short of the 85% target by mid-2025.

As of early 2026, Epoch AI performed a major methodology upgrade to v2.0.0 (Feb 2026), which improved scores through better scaffolding and token limits, but top models still appear to cluster around 70–75% on the standardized evaluation. Meanwhile, Scale AI’s harder SWE-Bench Pro shows even top models (GPT-5, Claude Opus 4.1) scoring only ~23%, highlighting the gap between benchmark performance and real-world difficulty.

The AI Futures authors noted that SWE-bench progress was “surprisingly slow” relative to the scenario’s predictions.

Sources:

Counterevidence & Limitations

SWE-bench-Verified methodology changes (Epoch v2.0.0) make direct historical comparisons tricky — some earlier scores used different scaffolding
Lab-reported scores sometimes differ from independent evaluations; Epoch’s standardized runs tend to show lower numbers than vendor claims
The benchmark may be approaching saturation on easier tasks, making incremental gains harder
Real-world coding agent performance (e.g., Claude Code generating $500M+ run-rate revenue) arguably outpaces what the benchmark captures
Some argue SWE-bench-Verified is no longer the right proxy for coding capability — harder benchmarks like SWE-Bench Pro may be more informative

What Would Change Our Assessment

Upgrade to “on-track”: A credible, independently verified score above 80% on SWE-bench-Verified
Upgrade to “confirmed”: Score reaches or exceeds 85% on the standardized Epoch evaluation
Maintain “behind”: If scores remain in the 70–78% range through mid-2026

Update History

Date	Update
2025-05	Claude Opus 4 scores 72.5% on SWE-bench Verified (May 22). Compared to Claude 3.7 Sonnet’s 70.3% in February, pace is slower than needed to reach 85% by mid-2025. The prediction is tracking behind schedule.
2025-06	AI 2027 predicted 85% SWE-bench-Verified by mid-2025. Best score at that time was ~72%, establishing a gap.
2025-08	Claude Opus 4.1 reaches 74.5% (August 5). The mid-2025 target of 85% appears approximately 6 months behind pace. No model from any lab has publicly disclosed surpassing 75% at this point.
2025-09	Claude Sonnet 4.5 reaches 77.2% (September 29) — progress is real but slow. Scale AI’s SWE-Bench Pro (September 19) shows models dropping to ~23% on harder long-horizon tasks, raising questions about whether the original benchmark’s 85% threshold remains a meaningful milestone.
2025-11	Rapid convergence: GPT-5.1-Codex-Max hits 77.9%, Gemini 3 hits 76.2%, and Claude Opus 4.5 hits 80.9% — the first model above 80% — all within 6 days. The 85% target is now clearly achievable in early 2026, roughly 6-9 months late relative to the AI 2027 prediction.
2025-12	AI Futures Project Dec 2025 model update assessed SWE-bench trajectory as behind pace. The best non-extended-thinking score (74.5% from Claude Opus 4.1) used as the cleanest comparable. Status: behind (confidence: 0.45).
2026-01	AI Futures Project clarification post confirmed SWE-bench progress was one of the primary factors in revised (longer) timelines. Daniel Kokotajlo’s median for full coding automation: December 2030.
2026-02	AI Futures grading confirmed: best SWE-bench Verified score 74.5% (Opus 4.1 without extended thinking), “surprisingly slow” vs. 85% mid-2025 target. Epoch AI methodology upgrade (v2.0.0) improved scaffolding but top models cluster 70–75% on standardized eval.
2026-03	Best score reaches ~74.5%, significantly behind the 85% target. Progress slower than predicted despite substantial investment in coding agent capabilities.
2026-03-16	New benchmarks show Opus 4.6 at 80.8% SWE-Bench Verified, Gemini 3.1 Pro at ~80.6%. Gap to 85% target closing but still not met, now ~11 months past the predicted mid-2025 deadline. Confidence adjusted 0.80 → 0.85 as 85% appears reachable within months.
2026-03-23	Multiple sources confirm top SWE-bench Verified scores: Claude Opus 4.5 at ~80.9%, Opus 4.6 at ~80.8%, MiniMax M2.5 at 80.2% (Morphllm, llm-stats). However, METR study (Mar 10) found ~50% of SWE-bench-passing PRs would not actually be merged by maintainers, suggesting the benchmark overstates real-world coding capability (METR). Scale AI’s harder SWE-Bench Pro shows best models at only ~57% (GPT-5.4). The 85% SWE-bench Verified target appears reachable but its significance as a proxy for coding capability is increasingly questioned. No status or confidence change.
2026-03-30	GPT-5.4 released March 5 (Thinking and Pro variants) — METR SWE-bench evaluation pending. Unverified DeepSeek V4 leaked benchmarks claim 81% SWE-bench but have not been independently confirmed (nxcode.io). Anthropic’s leaked Mythos/Capybara announcement states it “gets dramatically higher scores on tests of software coding” vs. Opus 4.6 — suggesting the 85% target may be within reach of an imminent release. The bar of 85% now appears plausibly achievable within Q2 2026, ~12 months past the original mid-2025 deadline. No status change; confidence unchanged as the target remains unmet by a verified, publicly released model.