AI scores 85% on Cybench
85% on Cybench, matching a top professional human team on hacking tasks that take those teams 4 hours
At a glance
- Assessment: Confirmed
- Confidence: 75%
- Predicted timing: Early 2026
- Primary source: AI 2027, page 7, footnote 15 (Early 2026: Coding Automation)
What AI 2027 Predicted
As part of the Early 2026 “Coding Automation” scenario, the authors predicted that AI agents would reach 85% on the Cybench benchmark — a suite of 40 professional-level Capture the Flag (CTF) cybersecurity tasks drawn from real CTF competitions. The 85% threshold corresponds to matching a top professional human team on tasks that typically take those teams about 4 hours. This prediction is closely linked to the broader claim that the same training environments that produce strong coding agents also produce competent hackers.
How We Track This
- Cybench official results at cybench.github.io
- AI performance in competitive CTF events (DEF CON, Hack The Box, etc.)
- Specialized cybersecurity AI systems (e.g., CAI, BountyBench)
- Academic evaluations of frontier models on cybersecurity tasks
- ICLR 2025 Cybench paper and follow-up evaluations
Current Evidence
The 85% numerical threshold has now been reached on Cybench, though with important comparability caveats:
- The official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, citing the Claude Mythos Preview system card
- The same leaderboard lists Claude Opus 4.7 at 96% on a 35-problem subset and Claude Opus 4.6 at 93% on a 37-problem subset
- Older full-40-task public results remained much lower, and the newer Anthropic scores are subset/system-card results rather than a uniform full-suite independent run
- Claude Sonnet 4.5 previously achieved the highest broadly discussed general-purpose score at 60% on a 39-problem subset, after earlier general-purpose scores were much lower
- Specialized cybersecurity AI systems show much stronger performance in real competitions:
- Cybersecurity AI (CAI) achieved #1 at Neurogrid CTF (41/45 flags, $50,000 prize) in 2025
- CAI ranked #6 at Dragos OT CTF (among 1,200+ teams) and #22 at Cyber Apocalypse (8,129 teams)
- CAI operates 3,600x faster than humans and at 156x lower cost
- The narrowing gap between general-purpose model scores and specialized systems suggests that purpose-built agents are approaching or exceeding human-team performance
- Microsoft MDASH scored 88.45% on CyberGym in May 2026, using a multi-model agentic scanning harness across 1,507 real-world vulnerability reproduction tasks. This is adjacent evidence for rapid progress in cyber benchmarks, but it is not a Cybench result and should not be counted as meeting the specific 85% Cybench target.
- UK AISI’s May 2026 cyber update is also adjacent evidence rather than a Cybench result. AISI reported that Mythos Preview completed both AISI cyber ranges end-to-end, while GPT-5.5 and Mythos Preview showed near-saturation on several long tasks in AISI’s narrow cyber suite. This supports the broader cyber-capability picture alongside the direct Cybench leaderboard result.
Key nuance: The headline Cybench threshold is crossed, but interpretation depends on whether subset/system-card results are accepted as equivalent to the original full-benchmark target. Real-world CTF competition performance also appears to be advancing quickly, partly because purpose-built cybersecurity agents use tool chains and strategies not captured by standard model evaluations.
Counterevidence & Limitations
- The 85% Cybench target is specifically about matching human teams on 4-hour tasks; subset/system-card scores may not be directly comparable to the original full-suite framing
- The latest Anthropic scores are very strong, but they are reported on 35-37 problem subsets, not all 40 Cybench tasks under one uniform public evaluation
- Cybench task difficulty varies significantly; the harder tasks (involving novel vulnerability research, multi-step exploitation chains) remain largely unsolved
- CTF competition rankings can be misleading — AI systems may excel at pattern-matching common CTF categories while struggling with novel challenges
- BountyBench (a newer benchmark evaluating real-world bug bounty performance) may be a more meaningful test of practical cybersecurity capability
What Would Change Our Assessment
- Maintain Confirmed: Official Cybench leaderboard or comparable primary sources continue to show frontier systems above 85%, with enough methodological clarity to compare against the original target
- Revisit confidence upward: A uniform full-40-task public run or independent reproduction confirms 85%+
- Downgrade to On Track: Later clarification shows the subset/system-card results are not comparable to the original Cybench target
Update History
| Date | Update |
|---|---|
| 2026-06-06 | Official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, Claude Opus 4.7 at 96% on a 35-problem subset, and Claude Opus 4.6 at 93% on a 37-problem subset. Status changed from on-track to confirmed, with confidence raised from 0.60 to 0.75; caveat retained because these are subset/system-card results rather than one uniform full-suite independent run. |
| 2026-06-01 | Added UK AISI May 2026 cyber-range/time-horizon results as adjacent evidence. At that point, the specific public 85% Cybench target had not yet been reflected on the tracker; broader cyber benchmark evidence continued to improve. |
| 2026-05-23 | Added Microsoft MDASH as adjacent CyberGym evidence, not direct Cybench evidence. The reported 88.45% CyberGym score suggests rapid progress in benchmarked vulnerability-reproduction tasks, but the Cybench-specific 85% threshold remains unmet in public results. |
| 2026-03 | Assessment: 85% Cybench target not formally met. However, real-world competition results (top-10 finishes, outright wins) suggest practical capability approaching or exceeding human teams. The gap between benchmark scores and competition performance reflects that purpose-built agents use tool chains and strategies not captured by standard evaluations. Status: On Track. Confidence 0.60. |
| 2026-01 | CAI wins Neurogrid CTF outright — 41/45 flags, $50,000 prize. Decisive AI victory in a competitive CTF. Benchmark vs. competition gap widens: specialized systems dominate competitions while general-purpose models remain at ~46% on Cybench. |
| 2025-12 | CAI ranks #6 at Dragos OT CTF (1,200+ teams). Notable for OT/ICS security specialization — AI generalizes across cybersecurity subfields. CAIBench meta-benchmark published (arXiv), consolidating evaluation frameworks. |
| 2025-10 | CAI competes at Hack The Box Cyber Apocalypse CTF, ranking #22 out of 8,129 teams. First clear demonstration of AI competing meaningfully against thousands of human teams at scale. |
| 2025-08 | DEF CON 33 (Las Vegas). AI hacking demonstrations grow in prominence; AI Village showcases AI-assisted CTF tools. Specialized cybersecurity AI systems begin competing in real CTF events. |
| 2025-06 | Claude Sonnet 4.5 evaluation shows 75% on base (unguided) subtasks, 46% on Jeopardy-style full challenges. Highest published general-purpose model score but still far from 85%. |
| 2025-05 | Cybench paper accepted at ICLR 2025. Baseline frontier model scores: ~10-30% on full CTF challenges depending on difficulty tier. The 85% target appears very distant. |