OSWorld benchmark reaches 65% by mid-2025

Confirmed · Model Capability · 90% confidence

Predicted: Mid 2025 (65%), Early 2026 (80%) · Updated: 2026-04-02 · Source: AI 2027, page 3, footnote 2 (Mid 2025: Stumbling Agents)

Specifically, we forecast that they score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).

What AI 2027 Predicted

The scenario forecasted a rapid progression in computer-using AI agents, measured against the OSWorld benchmark — a suite of 369 real computer tasks involving web browsing, desktop applications, file management, and multi-app workflows in actual OS environments (Ubuntu, Windows, macOS). The authors predicted agents would reach 65% by mid-2025 (approaching but not matching a typical skilled non-expert human at 70%) and 80% by early 2026 (matching or exceeding that human baseline).

A second, higher target appears in the Early 2026 section: “Specifically, we predict a score of 80% on OSWorld (equivalent to a skilled but non-expert human).” At the time of writing (April 2025), Anthropic’s Operator scored 38% — setting the baseline for the prediction.

How We Track This

OSWorld official leaderboard at os-world.github.io
OSWorld-Verified benchmark results (more rigorous evaluation subset)
Model release announcements from Anthropic, OpenAI, Google DeepMind with benchmark scores
XLANG Lab (CMU) publications and leaderboard updates

Current Evidence

Mid-2025 target (65%): Exceeded

The 65% threshold was crossed in late 2025. By early 2026, multiple frontier models surpass it:

Claude Opus 4.6 (Feb 2026): 72.7% on OSWorld — the current leader, up from 66.3% for Opus 4.5
CoACT-1: 60.76% (agentic framework approach)
Agentic frameworks generally score 45-61%, advanced foundation models 35-44%

The average score across all evaluated models is 41.4%, showing that while frontier models have crossed the threshold, it remains challenging for most systems.

Early 2026 target (80%): Not yet reached

No model has publicly achieved 80% on OSWorld as of April 2026. Claude Opus 4.6’s 72.7% is the highest published score, reaching approximately 84% of human performance but falling short of the 80% absolute target. The gap from 72.7% to 80% represents the harder tail of tasks — complex multi-app workflows, long-horizon OS operations, and tasks requiring nuanced visual understanding.

UiPath Screen Agent (powered by Claude Opus 4.5) achieved #1 on OSWorld-Verified (a stricter subset) in January 2026, demonstrating that specialized agent scaffolding can push scores higher.

Counterevidence & Limitations

OSWorld scores are sensitive to agent scaffolding and prompting strategies — raw model capability vs. engineered system scores can differ significantly
The benchmark was introduced in 2024; task difficulty calibration may shift as the community identifies easy vs. hard subsets
Human baseline of 70% was measured for “skilled non-expert” users; expert users likely score higher, making the 80% target a higher relative bar than it appears
OSWorld-Verified (a stricter evaluation) produces lower scores than the standard benchmark

What Would Change Our Assessment

Upgrade to Ahead: A model exceeds 80% before mid-2026, beating the Early 2026 timeline
Downgrade to On Track (for 80% target): No model reaches 80% by end of 2026, suggesting the Early 2026 prediction was too aggressive
Reinterpretation: If OSWorld methodology changes substantially (new tasks, re-scored baselines), the target numbers may need recalibration

Update History

Date	Update
2025-05	Prediction begins tracking. Best production agent: OpenAI CUA at 38.1%. Research agents (PC Agent-E at 14.9%, ARPO at 29.9%) show progress but 65% mid-2025 target looks unlikely.
2025-07	OSWorld-Verified launches with 300+ task fixes and stricter evaluation. GTA1 (Salesforce) reaches #1 at 45.2%. Mid-2025 65% target officially missed.
2025-08	GPT-5 and Claude Opus 4.1 release (~44%). CoACT-1 breaks 60% (60.76%) via hybrid GUI+coding approach. 65% threshold now within striking distance.
2025-09	Claude Sonnet 4.5 scores 61.4% on OSWorld. UiPath Screen Agent debuts at #2 on OSWorld-Verified. 65% target nearly met, ~3 months behind schedule.
2025-11	Claude Opus 4.5 scores 66.3% — 65% target confirmed, approximately 5 months behind AI-2027’s mid-2025 timeline. AskUI VisionAgent independently confirms 65%+ is achievable (66.2% on OSWorld-Verified). Confidence rises for 80% target.
2025-12	GPT-5.2 scores 47.3% on OSWorld-Verified — surprisingly below Claude’s 66.3%. 80% target still looks ambitious.
2026-01	UiPath Screen Agent (Claude Opus 4.5) achieves #1 on OSWorld-Verified at 67.1%. Enterprise agents now match research agents. 80% target unmet as “early 2026” window opens.
2026-02	Claude Opus 4.6 scores 72.7% — matching the human baseline (~72.4%). 80% not yet reached but progress accelerating.
2026-03	GPT-5.4 scores 75.0% on OSWorld-Verified — first model to surpass human baseline. 80% target likely within months. Status: Confirmed for 65%; 80% target running ~3-6 months late. Confidence 0.90.