Model Capability Predictions

This category contains 7 tracked predictions. Each page includes the original claim, current evidence, counterevidence, and what would change our assessment.

10²⁸ FLOP training run completed Not Yet Testable

March 2027 Updated 2026-03-13 50% confidence

Agent-2 trained at 2×10²⁸ FLOP (Apr 2026 – Mar 2027), representing a 1000× increase over GPT-4 scale.

Frontier model trained at 10²⁷ FLOP (Agent-0, completes May 2025) Emerging

May 2025 Updated 2026-03-13 45% confidence

OpenBrain's latest public model—Agent-0—was trained with 10²⁷ FLOP.

Frontier model uses daily weight updates / online learning Not Yet Testable

January 2027 Updated 2026-03-13 20% confidence

Agent-2 is effectively 'online learning,' in that it's built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.

High-bandwidth non-text reasoning (neuralese) deployed Emerging

Early 2027 Updated 2026-03-13 35% confidence

One such breakthrough is augmenting the AI's text-based scratchpad (chain of thought) with a higher-bandwidth thought process (neuralese recurrence and memory).

IDA achieves superhuman performance at coding Emerging

Early 2027 Updated 2026-03-13 40% confidence

Now, the models have become sufficiently good at verifying more subjective things, allowing the use of IDA to improve the model at many tasks.

Models shift to continuous/iterative training Confirmed

Late 2025 Updated 2026-03-13 90% confidence

By this point 'finishes training' is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses.

OSWorld benchmark reaches 65% by mid-2025 Confirmed

Mid 2025 (65%), Early 2026 (80%) Updated 2026-03-13 90% confidence

Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).