Agents struggle with long-horizon tasks
Agent-1 is bad at even simple long-horizon tasks (page 7, Early 2026 section). Also: agents in Mid 2025 are 'impressive in theory but in practice unreliable.'
At a glance
- Assessment: Confirmed
- Confidence: 85%
- Predicted timing: 2025
- Primary source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)
What AI 2027 Predicted
The scenario predicts that while agents become useful for short tasks, they struggle with sustained, long-horizon work — tasks that require hours or days of coherent effort. This is framed as a temporary limitation that improves rapidly.
How We Track This
We monitor:
- METR time horizon benchmarks (50% and 80% thresholds)
- Reports on agent task completion rates for multi-hour workflows
- Enterprise adoption patterns for long-running agent tasks
- Academic research on agent planning and coherence over time
Current Evidence
This prediction was accurate for 2025 but is rapidly becoming outdated. METR’s 80% time horizon shows Claude Opus 4.6 crossed a full work-day at 14.5 hours, doubling every ~123 days. Ajeya Cotra notes the 50% time horizon may reach 20 hours by end of 2026. Multi-day projects still fail but the horizon is expanding fast.
Sources:
- The State of AI Agents in 2026 — Metavert
- I underestimated AI capabilities (again) — Ajeya Cotra
- METR: Measuring AI Ability to Complete Long Tasks (arXiv)
Counterevidence & Limitations
- The METR benchmark tasks may not reflect real-world complexity — they are synthetic and designed for measurability rather than ecological validity
- Improvements in time horizon don’t necessarily translate to reliability on diverse, open-ended tasks — agents may improve on structured tasks while still failing on ambiguous real-world problems
- Some companies report agent failures on production workloads even when benchmarks improve, suggesting a gap between controlled and deployed performance
- The prediction is becoming harder to score as the situation evolves: the 2025 claim is confirmed, but the rapid improvement trajectory means the “struggle” characterization has a short shelf life
- There is no standardized definition of “long-horizon” — different benchmarks use different task lengths, making cross-study comparison difficult
What Would Change Our Assessment
- Historical note: This prediction specifically targeted 2025 and has been confirmed. By late 2026, agents may handle multi-day tasks reliably.
- Watch for: METR 50% time horizon exceeding 24 hours
Update History
| Date | Update |
|---|---|
| 2026-03 | Prediction confirmed for 2025 timeframe. Agents still struggle with multi-day autonomous work, though rapid improvement visible in early 2026. |
| 2026-01 | METR Time Horizon 1.1: Claude Opus 4.5 at ~4h49m (Jan 2026). Agents now sustain coherent work for nearly 5 hours — well beyond the “simple long-horizon tasks” the prediction describes. The “struggle” characterization from mid-2025 is increasingly outdated for frontier models, though it remains accurate for most publicly available agents. |
| 2025-09 | Scale AI SWE-Bench Pro (Sep 19): models scoring 70%+ on standard SWE-bench drop to ~23% on long-horizon tasks involving multi-file refactors and cross-repository changes. Long-horizon capability gap now formally measured. |
| 2025-07 | METR domain time-horizon analysis (July 14): 50% success horizon for frontier models at ~50 minutes on human expert tasks. Tasks requiring hours of sustained autonomous work remain beyond current capability. |
| 2025-06 | SWE-bench and GAIA benchmarks confirm agents fail on tasks requiring sustained multi-hour work. Error accumulation over long horizons remains the primary failure mode. |