Agents struggle with long-horizon tasks

Author Johannes Haus
Last updated
Confirmed · Agent Autonomy · 85% confidence
Predicted: 2025 · Updated: 2026-03-13 · Source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)
Agent-1 is bad at even simple long-horizon tasks (page 7, Early 2026 section). Also: agents in Mid 2025 are 'impressive in theory but in practice unreliable.'

At a glance

  • Assessment: Confirmed
  • Confidence: 85%
  • Predicted timing: 2025
  • Primary source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)

What AI 2027 Predicted

The scenario predicts that while agents become useful for short tasks, they struggle with sustained, long-horizon work — tasks that require hours or days of coherent effort. This is framed as a temporary limitation that improves rapidly.

How We Track This

We monitor:

  • METR time horizon benchmarks (50% and 80% thresholds)
  • Reports on agent task completion rates for multi-hour workflows
  • Enterprise adoption patterns for long-running agent tasks
  • Academic research on agent planning and coherence over time

Current Evidence

This prediction was accurate for 2025 but is rapidly becoming outdated. METR’s 80% time horizon shows Claude Opus 4.6 crossed a full work-day at 14.5 hours, doubling every ~123 days. Ajeya Cotra notes the 50% time horizon may reach 20 hours by end of 2026. Multi-day projects still fail but the horizon is expanding fast.

Sources:

Counterevidence & Limitations

  • The METR benchmark tasks may not reflect real-world complexity — they are synthetic and designed for measurability rather than ecological validity
  • Improvements in time horizon don’t necessarily translate to reliability on diverse, open-ended tasks — agents may improve on structured tasks while still failing on ambiguous real-world problems
  • Some companies report agent failures on production workloads even when benchmarks improve, suggesting a gap between controlled and deployed performance
  • The prediction is becoming harder to score as the situation evolves: the 2025 claim is confirmed, but the rapid improvement trajectory means the “struggle” characterization has a short shelf life
  • There is no standardized definition of “long-horizon” — different benchmarks use different task lengths, making cross-study comparison difficult

What Would Change Our Assessment

  • Historical note: This prediction specifically targeted 2025 and has been confirmed. By late 2026, agents may handle multi-day tasks reliably.
  • Watch for: METR 50% time horizon exceeding 24 hours

Update History

DateUpdate
2026-03Prediction confirmed for 2025 timeframe. Agents still struggle with multi-day autonomous work, though rapid improvement visible in early 2026.
2026-01METR Time Horizon 1.1: Claude Opus 4.5 at ~4h49m (Jan 2026). Agents now sustain coherent work for nearly 5 hours — well beyond the “simple long-horizon tasks” the prediction describes. The “struggle” characterization from mid-2025 is increasingly outdated for frontier models, though it remains accurate for most publicly available agents.
2025-09Scale AI SWE-Bench Pro (Sep 19): models scoring 70%+ on standard SWE-bench drop to ~23% on long-horizon tasks involving multi-file refactors and cross-repository changes. Long-horizon capability gap now formally measured.
2025-07METR domain time-horizon analysis (July 14): 50% success horizon for frontier models at ~50 minutes on human expert tasks. Tasks requiring hours of sustained autonomous work remain beyond current capability.
2025-06SWE-bench and GAIA benchmarks confirm agents fail on tasks requiring sustained multi-hour work. Error accumulation over long horizons remains the primary failure mode.