OSWorld benchmark reaches 65% by mid-2025
Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).
What AI 2027 Predicted
The scenario forecasted a rapid progression in computer-using AI agents, measured against the OSWorld benchmark — a suite of 369 real computer tasks involving web browsing, desktop applications, file management, and multi-app workflows in actual OS environments (Ubuntu, Windows, macOS). The authors predicted agents would reach 65% by mid-2025 (approaching but not matching a typical skilled non-expert human at 70%) and 80% by early 2026 (matching or exceeding that human baseline).
A second, higher target appears in the Early 2026 section: “Specifically, we predict a score of 80% on OSWorld (equivalent to a skilled but non-expert human).” At the time of writing (April 2025), Anthropic’s Operator scored 38% — setting the baseline for the prediction.
How We Track This
- OSWorld official leaderboard at os-world.github.io
- OSWorld-Verified benchmark results (more rigorous evaluation subset)
- Model release announcements from Anthropic, OpenAI, Google DeepMind with benchmark scores
- XLANG Lab (CMU) publications and leaderboard updates
Current Evidence
Mid-2025 target (65%): Exceeded
The 65% threshold was crossed in late 2025. By early 2026, multiple frontier models surpass it:
- Claude Opus 4.6 (Feb 2026): 72.7% on OSWorld — the current leader, up from 66.3% for Opus 4.5
- CoACT-1: 60.76% (agentic framework approach)
- Agentic frameworks generally score 45-61%, advanced foundation models 35-44%
The average score across all evaluated models is 41.4%, showing that while frontier models have crossed the threshold, it remains challenging for most systems.
Early 2026 target (80%): Not yet reached
No model has publicly achieved 80% on OSWorld as of April 2026. Claude Opus 4.6’s 72.7% is the highest published score, reaching approximately 84% of human performance but falling short of the 80% absolute target. The gap from 72.7% to 80% represents the harder tail of tasks — complex multi-app workflows, long-horizon OS operations, and tasks requiring nuanced visual understanding.
UiPath Screen Agent (powered by Claude Opus 4.5) achieved #1 on OSWorld-Verified (a stricter subset) in January 2026, demonstrating that specialized agent scaffolding can push scores higher.
Counterevidence & Limitations
- OSWorld scores are sensitive to agent scaffolding and prompting strategies — raw model capability vs. engineered system scores can differ significantly
- The benchmark was introduced in 2024; task difficulty calibration may shift as the community identifies easy vs. hard subsets
- Human baseline of 70% was measured for “skilled non-expert” users; expert users likely score higher, making the 80% target a higher relative bar than it appears
- OSWorld-Verified (a stricter evaluation) produces lower scores than the standard benchmark
What Would Change Our Assessment
- Upgrade to Ahead: A model exceeds 80% before mid-2026, beating the Early 2026 timeline
- Downgrade to On Track (for 80% target): No model reaches 80% by end of 2026, suggesting the Early 2026 prediction was too aggressive
- Reinterpretation: If OSWorld methodology changes substantially (new tasks, re-scored baselines), the target numbers may need recalibration
Update History
| Date | Update |
|---|---|
| 2025-05 | Prediction begins tracking. Best production agent: OpenAI CUA at 38.1%. Research agents (PC Agent-E at 14.9%, ARPO at 29.9%) show progress but 65% mid-2025 target looks unlikely. |
| 2025-07 | OSWorld-Verified launches with 300+ task fixes and stricter evaluation. GTA1 (Salesforce) reaches #1 at 45.2%. Mid-2025 65% target officially missed. |
| 2025-08 | GPT-5 and Claude Opus 4.1 release (~44%). CoACT-1 breaks 60% (60.76%) via hybrid GUI+coding approach. 65% threshold now within striking distance. |
| 2025-09 | Claude Sonnet 4.5 scores 61.4% on OSWorld. UiPath Screen Agent debuts at #2 on OSWorld-Verified. 65% target nearly met, ~3 months behind schedule. |
| 2025-11 | Claude Opus 4.5 scores 66.3% — 65% target confirmed, approximately 5 months behind AI-2027’s mid-2025 timeline. AskUI VisionAgent independently confirms 65%+ is achievable (66.2% on OSWorld-Verified). Confidence rises for 80% target. |
| 2025-12 | GPT-5.2 scores 47.3% on OSWorld-Verified — surprisingly below Claude’s 66.3%. 80% target still looks ambitious. |
| 2026-01 | UiPath Screen Agent (Claude Opus 4.5) achieves #1 on OSWorld-Verified at 67.1%. Enterprise agents now match research agents. 80% target unmet as “early 2026” window opens. |
| 2026-02 | Claude Opus 4.6 scores 72.7% — matching the human baseline (~72.4%). 80% not yet reached but progress accelerating. |
| 2026-03 | GPT-5.4 scores 75.0% on OSWorld-Verified — first model to surpass human baseline. 80% target likely within months. Status: Confirmed for 65%; 80% target running ~3-6 months late. Confidence 0.90. |