RE-Bench score reaches 1.3
1.3 on RE-Bench matching top expert humans given 8 hours at well-defined AI research engineering tasks.
What AI 2027 Predicted
The scenario predicted that by early 2026, AI agents would score 1.3 on METR’s RE-Bench (Research Engineering Benchmark) — a suite of 7 challenging, open-ended ML research engineering tasks. A score of 1.3 means surpassing what top human experts achieve when given a full 8-hour session on well-defined research engineering tasks like implementing ML algorithms, debugging research code, and optimizing training pipelines. This prediction sits in the “Coding Automation” section, representing the point where AI begins to meaningfully accelerate AI R&D itself.
How We Track This
- METR’s official RE-Bench results and publications at metr.org/research
- Model-specific evaluations published by METR
- METR time horizon measurements (related metric, drawn partly from RE-Bench tasks)
- AI lab announcements referencing RE-Bench or research engineering capability
Current Evidence
The 1.3 target has not been reached as of April 2026:
- Best AI agents achieve a score 4x higher than human experts at the 2-hour budget — but this comparison is at a short time horizon, not the 8-hour reference
- At the full 8-hour human reference, frontier models (Claude 3.5 Sonnet, o1-preview in the original evaluation; more recent models since) score in the 0.5-0.8 range, well below 1.3
- METR’s preliminary evaluation of Claude 3.7 Sonnet showed “impressive AI R&D capabilities on a subset of RE-Bench” but no published score exceeding 1.0
- External forecasts project RE-Bench reaching 1.0 in 2027, suggesting 1.3 by early 2026 was too aggressive
- The related METR time horizon metric shows strong progress: Claude Opus 4.6 has a 50%-time-horizon of 14.5 hours on software tasks (up from ~5 hours for Opus 4.5), indicating rapid capability growth — but RE-Bench measures a specific type of research engineering skill, not general software task completion
The prediction appears approximately 12-18 months too early. The trajectory suggests 1.3 could be reached in late 2027, which would align with the paper’s broader narrative about AI automating research engineering.
Counterevidence & Limitations
- RE-Bench V1 consists of only 7 tasks, making it a noisy benchmark — small changes in model capability can produce large score swings
- The benchmark may be nearing the limits of its discriminative power as models improve; METR is developing updated evaluation approaches
- The 8-hour human expert reference was established with 61 human experts across 71 attempts; this baseline may shift as the expert pool is refined
- Some research engineering capabilities may not be well-captured by RE-Bench’s specific tasks (e.g., novel architecture design, long-horizon experiment planning)
- The strong METR time horizon results suggest general autonomous capability is growing fast, even if RE-Bench-specific scores haven’t reached 1.3
What Would Change Our Assessment
- Upgrade to On Track: A frontier model scores above 1.0 on RE-Bench or an equivalent research engineering evaluation
- Upgrade to Confirmed: A score of 1.3+ is published on RE-Bench V1 or a comparable successor benchmark
- Downgrade to Behind: If models remain below 1.0 through end of 2026 and time-horizon gains don’t translate to RE-Bench scores
Update History
| Date | Update |
|---|---|
| 2025-05 | RE-Bench V1 baseline established (published late 2024 by METR). Claude 3.5 Sonnet and o1-preview score ~0.5-0.8 at 8-hour human-expert reference. Models outperform humans ~4x at 2-hour budget but plateau or regress on longer tasks. Target of 1.3 appears distant. |
| 2025-07 | METR publishes study on early-2025 AI developer productivity (July 10). Finding: experienced developers were 19% slower with AI tools in a controlled trial — contradicting self-reported 20% speedups. Raises questions about AI’s ability to accelerate sustained research work. |
| 2025-09 | METR evaluates Claude 3.7 Sonnet on RE-Bench (preliminary). Results described as “impressive” on a subset of tasks. Exact numeric score not widely published, but qualitative signal is positive. |
| 2025-12 | METR begins publishing time-horizon metric — measuring how long a task an AI can autonomously complete at human-expert level. Reframes the RE-Bench question from a single score to “how many hours of autonomous work can an AI sustain?“ |
| 2026-01 | METR publishes Time Horizon 1.1 (Jan 29). Claude Opus 4.5 measured at ~4 hours 49 minutes — a substantial jump suggesting rapid progress toward the 8-hour RE-Bench reference level. External forecasters had predicted RE-Bench 1.0 in 2027; this pace suggests that was too conservative. |
| 2026-02 | Claude Opus 4.6 measured at ~14.5 hours time horizon — surpassing the 8-hour human-expert reference level. If this translates to RE-Bench V1 scoring, it implies agents can now match or exceed top experts on 8-hour research engineering tasks. METR announces redesign of developer productivity study (Feb 24), noting developers now refuse to work without AI, invalidating earlier “19% slower” finding. Status revised from Behind to On Track. Confidence 0.60. |
| 2026-03 | The 3x improvement from Opus 4.5 (~4h49m) to Opus 4.6 (~14.5h) in weeks suggests the 1.3 score may be within reach. Key uncertainty: whether time-horizon metric and RE-Bench V1 numeric scores map cleanly onto each other. The spirit of the prediction — AI matching top human experts on 8-hour tasks — appears approximately vindicated by time-horizon data. |