SWE-bench-Verified score reaches 85%
We forecast that mid-2025 agents will score 85% on SWEBench-Verified. (page 3, footnote 3. Note: the 72% starting point is tracker-added context, not from the source.)
What AI 2027 Predicted
The AI 2027 scenario implied that SWE-bench-Verified scores would reach 85% by mid-2025, starting from a baseline of roughly 72% at the time of writing. This was part of the broader narrative arc toward fully automated coding by early 2027 — SWE-bench progress was treated as a leading indicator of when AI would surpass human software engineers on standard tasks.
How We Track This
We monitor:
- Official SWE-bench-Verified leaderboard scores
- Epoch AI’s standardized SWE-bench-Verified evaluations (v2.0.0 methodology as of Feb 2026)
- Individual model announcements from Anthropic, OpenAI, Google, and others
- Scale AI’s SWE-Bench Pro results for harder task subsets
Current Evidence
Progress on SWE-bench-Verified has been significantly slower than the scenario predicted. The AI Futures Project’s own grading (Feb 2026) confirmed the best score at the time was 74.5% (Claude Opus 4.1), well short of the 85% target by mid-2025.
As of early 2026, Epoch AI performed a major methodology upgrade to v2.0.0 (Feb 2026), which improved scores through better scaffolding and token limits, but top models still appear to cluster around 70–75% on the standardized evaluation. Meanwhile, Scale AI’s harder SWE-Bench Pro shows even top models (GPT-5, Claude Opus 4.1) scoring only ~23%, highlighting the gap between benchmark performance and real-world difficulty.
The AI Futures authors noted that SWE-bench progress was “surprisingly slow” relative to the scenario’s predictions.
Sources:
- Grading AI 2027’s 2025 Predictions — AI Futures Project
- SWE-bench Verified — Epoch AI
- SWE-Bench Pro Leaderboard — Scale Labs
Counterevidence & Limitations
- SWE-bench-Verified methodology changes (Epoch v2.0.0) make direct historical comparisons tricky — some earlier scores used different scaffolding
- Lab-reported scores sometimes differ from independent evaluations; Epoch’s standardized runs tend to show lower numbers than vendor claims
- The benchmark may be approaching saturation on easier tasks, making incremental gains harder
- Real-world coding agent performance (e.g., Claude Code generating $500M+ run-rate revenue) arguably outpaces what the benchmark captures
- Some argue SWE-bench-Verified is no longer the right proxy for coding capability — harder benchmarks like SWE-Bench Pro may be more informative
What Would Change Our Assessment
- Upgrade to “on-track”: A credible, independently verified score above 80% on SWE-bench-Verified
- Upgrade to “confirmed”: Score reaches or exceeds 85% on the standardized Epoch evaluation
- Maintain “behind”: If scores remain in the 70–78% range through mid-2026
Update History
| Date | Update |
|---|---|
| 2025-05 | Claude Opus 4 scores 72.5% on SWE-bench Verified (May 22). Compared to Claude 3.7 Sonnet’s 70.3% in February, pace is slower than needed to reach 85% by mid-2025. The prediction is tracking behind schedule. |
| 2025-06 | AI 2027 predicted 85% SWE-bench-Verified by mid-2025. Best score at that time was ~72%, establishing a gap. |
| 2025-08 | Claude Opus 4.1 reaches 74.5% (August 5). The mid-2025 target of 85% appears approximately 6 months behind pace. No model from any lab has publicly disclosed surpassing 75% at this point. |
| 2025-09 | Claude Sonnet 4.5 reaches 77.2% (September 29) — progress is real but slow. Scale AI’s SWE-Bench Pro (September 19) shows models dropping to ~23% on harder long-horizon tasks, raising questions about whether the original benchmark’s 85% threshold remains a meaningful milestone. |
| 2025-11 | Rapid convergence: GPT-5.1-Codex-Max hits 77.9%, Gemini 3 hits 76.2%, and Claude Opus 4.5 hits 80.9% — the first model above 80% — all within 6 days. The 85% target is now clearly achievable in early 2026, roughly 6-9 months late relative to the AI 2027 prediction. |
| 2025-12 | AI Futures Project Dec 2025 model update assessed SWE-bench trajectory as behind pace. The best non-extended-thinking score (74.5% from Claude Opus 4.1) used as the cleanest comparable. Status: behind (confidence: 0.45). |
| 2026-01 | AI Futures Project clarification post confirmed SWE-bench progress was one of the primary factors in revised (longer) timelines. Daniel Kokotajlo’s median for full coding automation: December 2030. |
| 2026-02 | AI Futures grading confirmed: best SWE-bench Verified score 74.5% (Opus 4.1 without extended thinking), “surprisingly slow” vs. 85% mid-2025 target. Epoch AI methodology upgrade (v2.0.0) improved scaffolding but top models cluster 70–75% on standardized eval. |
| 2026-03 | Best score reaches ~74.5%, significantly behind the 85% target. Progress slower than predicted despite substantial investment in coding agent capabilities. |
| 2026-03-16 | New benchmarks show Opus 4.6 at 80.8% SWE-Bench Verified, Gemini 3.1 Pro at ~80.6%. Gap to 85% target closing but still not met, now ~11 months past the predicted mid-2025 deadline. Confidence adjusted 0.80 → 0.85 as 85% appears reachable within months. |
| 2026-03-23 | Multiple sources confirm top SWE-bench Verified scores: Claude Opus 4.5 at ~80.9%, Opus 4.6 at ~80.8%, MiniMax M2.5 at 80.2% (Morphllm, llm-stats). However, METR study (Mar 10) found ~50% of SWE-bench-passing PRs would not actually be merged by maintainers, suggesting the benchmark overstates real-world coding capability (METR). Scale AI’s harder SWE-Bench Pro shows best models at only ~57% (GPT-5.4). The 85% SWE-bench Verified target appears reachable but its significance as a proxy for coding capability is increasingly questioned. No status or confidence change. |
| 2026-03-30 | GPT-5.4 released March 5 (Thinking and Pro variants) — METR SWE-bench evaluation pending. Unverified DeepSeek V4 leaked benchmarks claim 81% SWE-bench but have not been independently confirmed (nxcode.io). Anthropic’s leaked Mythos/Capybara announcement states it “gets dramatically higher scores on tests of software coding” vs. Opus 4.6 — suggesting the 85% target may be within reach of an imminent release. The bar of 85% now appears plausibly achievable within Q2 2026, ~12 months past the original mid-2025 deadline. No status change; confidence unchanged as the target remains unmet by a verified, publicly released model. |