Agents struggle with long-horizon tasks

Confirmed · Agent Autonomy · 85% confidence

Predicted: 2025 · Updated: 2026-04-02 · Source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)

Agent-1 is bad at even simple long-horizon tasks (page 7, Early 2026 section). Also: agents in Mid 2025 are 'impressive in theory but in practice unreliable.'

What AI 2027 Predicted

The scenario predicts that while agents become useful for short tasks, they struggle with sustained, long-horizon work — tasks that require hours or days of coherent effort. This is framed as a temporary limitation that improves rapidly.

How We Track This

We monitor:

METR time horizon benchmarks (50% and 80% thresholds)
Reports on agent task completion rates for multi-hour workflows
Enterprise adoption patterns for long-running agent tasks
Academic research on agent planning and coherence over time

Current Evidence

This prediction was accurate for 2025 but is rapidly becoming outdated. METR’s 80% time horizon shows Claude Opus 4.6 crossed a full work-day at 14.5 hours, doubling every ~123 days. Ajeya Cotra notes the 50% time horizon may reach 20 hours by end of 2026. Multi-day projects still fail but the horizon is expanding fast.

Sources:

Counterevidence & Limitations

The METR benchmark tasks may not reflect real-world complexity — they are synthetic and designed for measurability rather than ecological validity
Improvements in time horizon don’t necessarily translate to reliability on diverse, open-ended tasks — agents may improve on structured tasks while still failing on ambiguous real-world problems
Some companies report agent failures on production workloads even when benchmarks improve, suggesting a gap between controlled and deployed performance
The prediction is becoming harder to score as the situation evolves: the 2025 claim is confirmed, but the rapid improvement trajectory means the “struggle” characterization has a short shelf life
There is no standardized definition of “long-horizon” — different benchmarks use different task lengths, making cross-study comparison difficult

What Would Change Our Assessment

Historical note: This prediction specifically targeted 2025 and has been confirmed. By late 2026, agents may handle multi-day tasks reliably.
Watch for: METR 50% time horizon exceeding 24 hours

Update History

Date	Update
2025-07	METR domain time-horizon analysis (July 14): 50% success horizon for frontier models at ~50 minutes on human expert tasks. Tasks requiring hours of sustained autonomous work remain beyond current capability.
2025-09	Scale AI SWE-Bench Pro (Sep 19): models scoring 70%+ on standard SWE-bench drop to ~23% on long-horizon tasks involving multi-file refactors and cross-repository changes. Long-horizon capability gap now formally measured.
2025-06	SWE-bench and GAIA benchmarks confirm agents fail on tasks requiring sustained multi-hour work. Error accumulation over long horizons remains the primary failure mode.
2026-01	METR Time Horizon 1.1: Claude Opus 4.5 at ~4h49m (Jan 2026). Agents now sustain coherent work for nearly 5 hours — well beyond the “simple long-horizon tasks” the prediction describes. The “struggle” characterization from mid-2025 is increasingly outdated for frontier models, though it remains accurate for most publicly available agents.
2026-03	Prediction confirmed for 2025 timeframe. Agents still struggle with multi-day autonomous work, though rapid improvement visible in early 2026.