Model Capability
Benchmarks, model performance, and signs of frontier capability progress.
This category contains 7 tracked predictions. Each page includes the original claim, current evidence, counterevidence, and what would change our assessment.
Agent-2 trained at 2×10²⁸ FLOP (Apr 2026 – Mar 2027), representing a 1000× increase over GPT-4 scale.
OpenBrain's latest public model—Agent-0—was trained with 10²⁷ FLOP.
Agent-2 is effectively 'online learning,' in that it's built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.
One such breakthrough is augmenting the AI's text-based scratchpad (chain of thought) with a higher-bandwidth thought process (neuralese recurrence and memory).
Now, the models have become sufficiently good at verifying more subjective things, allowing the use of IDA to improve the model at many tasks.
By this point 'finishes training' is a bit of a misnomer; models are frequently updated to newer versions trained on additional data or partially re-trained to patch some weaknesses.
Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).