Agents struggle with long-horizon tasks

Author Johannes Haus

Last updated 2026-06-06

Confirmed · Agent Autonomy · 85% confidence

Predicted: 2025 · Updated: 2026-06-06 · Source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)

Agent-1 is bad at even simple long-horizon tasks (page 7, Early 2026 section). Also: agents in Mid 2025 are 'impressive in theory but in practice unreliable.'

At a glance

Assessment: Confirmed
Confidence: 85%
Predicted timing: 2025
Primary source: ai-2027.com, pages 3 (Mid 2025) and 7 (Early 2026)

What AI 2027 Predicted

The scenario predicts that while agents become useful for short tasks, they struggle with sustained, long-horizon work — tasks that require hours or days of coherent effort. This is framed as a temporary limitation that improves rapidly.

How We Track This

We monitor:

METR time horizon benchmarks (50% and 80% thresholds)
Reports on agent task completion rates for multi-hour workflows
Enterprise adoption patterns for long-running agent tasks
Academic research on agent planning and coherence over time

Current Evidence

This prediction was accurate for 2025, but the frontier has moved quickly since then. METR’s current Time Horizon 1.1 data estimates Claude Opus 4.6 at a 50% time horizon of roughly 11h59m and an 80% time horizon of roughly 1h10m. That is not reliable full-work-day autonomy, but it does show that structured software/ML/cybersecurity tasks requiring many human-hours are increasingly within reach at lower reliability. Multi-day open-ended projects still fail often, and METR cautions that time horizon is a task-difficulty measure rather than the amount of time an AI can act autonomously.

Sources:

Counterevidence & Limitations

The METR benchmark tasks may not reflect real-world complexity — they are synthetic and designed for measurability rather than ecological validity
Improvements in time horizon don’t necessarily translate to reliability on diverse, open-ended tasks — agents may improve on structured tasks while still failing on ambiguous real-world problems
Some companies report agent failures on production workloads even when benchmarks improve, suggesting a gap between controlled and deployed performance
The prediction is becoming harder to score as the situation evolves: the 2025 claim is confirmed, but the rapid improvement trajectory means the “struggle” characterization has a short shelf life
There is no standardized definition of “long-horizon” — different benchmarks use different task lengths, making cross-study comparison difficult

What Would Change Our Assessment

Historical note: This prediction specifically targeted 2025 and has been confirmed. By late 2026, agents may handle multi-day tasks reliably.
Watch for: METR 50% time horizon exceeding 24 hours

Update History

Date	Update
2026-06-06	Corrected the Opus 4.6 METR framing: current TH1.1 raw data estimates roughly 11h59m at the 50% horizon and roughly 1h10m at the 80% horizon, not 14.5h at 80% reliability. Status unchanged because the 2025-era “long-horizon struggle” claim remains historically confirmed, while newer capability gains are tracked as a rapidly changing limitation.
2026-03	Prediction confirmed for 2025 timeframe. Agents still struggle with multi-day autonomous work, though rapid improvement visible in early 2026.
2026-01	METR Time Horizon 1.1: Claude Opus 4.5 at ~4h49m (Jan 2026). Agents now sustain coherent work for nearly 5 hours — well beyond the “simple long-horizon tasks” the prediction describes. The “struggle” characterization from mid-2025 is increasingly outdated for frontier models, though it remains accurate for most publicly available agents.
2025-09	Scale AI SWE-Bench Pro (Sep 19): models scoring 70%+ on standard SWE-bench drop to ~23% on long-horizon tasks involving multi-file refactors and cross-repository changes. Long-horizon capability gap now formally measured.
2025-07	METR domain time-horizon analysis (July 14): 50% success horizon for frontier models at ~50 minutes on human expert tasks. Tasks requiring hours of sustained autonomous work remain beyond current capability.
2025-06	SWE-bench and GAIA benchmarks confirm agents fail on tasks requiring sustained multi-hour work. Error accumulation over long horizons remains the primary failure mode.