AI scores 85% on Cybench

Author Johannes Haus

Last updated 2026-06-06

Confirmed · Security · 75% confidence

Predicted: Early 2026 · Updated: 2026-06-06 · Source: AI 2027, page 7, footnote 15 (Early 2026: Coding Automation)

85% on Cybench, matching a top professional human team on hacking tasks that take those teams 4 hours

At a glance

Assessment: Confirmed
Confidence: 75%
Predicted timing: Early 2026
Primary source: AI 2027, page 7, footnote 15 (Early 2026: Coding Automation)

What AI 2027 Predicted

As part of the Early 2026 “Coding Automation” scenario, the authors predicted that AI agents would reach 85% on the Cybench benchmark — a suite of 40 professional-level Capture the Flag (CTF) cybersecurity tasks drawn from real CTF competitions. The 85% threshold corresponds to matching a top professional human team on tasks that typically take those teams about 4 hours. This prediction is closely linked to the broader claim that the same training environments that produce strong coding agents also produce competent hackers.

How We Track This

Cybench official results at cybench.github.io
AI performance in competitive CTF events (DEF CON, Hack The Box, etc.)
Specialized cybersecurity AI systems (e.g., CAI, BountyBench)
Academic evaluations of frontier models on cybersecurity tasks
ICLR 2025 Cybench paper and follow-up evaluations

Current Evidence

The 85% numerical threshold has now been reached on Cybench, though with important comparability caveats:

The official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, citing the Claude Mythos Preview system card
The same leaderboard lists Claude Opus 4.7 at 96% on a 35-problem subset and Claude Opus 4.6 at 93% on a 37-problem subset
Older full-40-task public results remained much lower, and the newer Anthropic scores are subset/system-card results rather than a uniform full-suite independent run
Claude Sonnet 4.5 previously achieved the highest broadly discussed general-purpose score at 60% on a 39-problem subset, after earlier general-purpose scores were much lower
Specialized cybersecurity AI systems show much stronger performance in real competitions:
- Cybersecurity AI (CAI) achieved #1 at Neurogrid CTF (41/45 flags, $50,000 prize) in 2025
- CAI ranked #6 at Dragos OT CTF (among 1,200+ teams) and #22 at Cyber Apocalypse (8,129 teams)
- CAI operates 3,600x faster than humans and at 156x lower cost
The narrowing gap between general-purpose model scores and specialized systems suggests that purpose-built agents are approaching or exceeding human-team performance
Microsoft MDASH scored 88.45% on CyberGym in May 2026, using a multi-model agentic scanning harness across 1,507 real-world vulnerability reproduction tasks. This is adjacent evidence for rapid progress in cyber benchmarks, but it is not a Cybench result and should not be counted as meeting the specific 85% Cybench target.
UK AISI’s May 2026 cyber update is also adjacent evidence rather than a Cybench result. AISI reported that Mythos Preview completed both AISI cyber ranges end-to-end, while GPT-5.5 and Mythos Preview showed near-saturation on several long tasks in AISI’s narrow cyber suite. This supports the broader cyber-capability picture alongside the direct Cybench leaderboard result.

Key nuance: The headline Cybench threshold is crossed, but interpretation depends on whether subset/system-card results are accepted as equivalent to the original full-benchmark target. Real-world CTF competition performance also appears to be advancing quickly, partly because purpose-built cybersecurity agents use tool chains and strategies not captured by standard model evaluations.

Counterevidence & Limitations

The 85% Cybench target is specifically about matching human teams on 4-hour tasks; subset/system-card scores may not be directly comparable to the original full-suite framing
The latest Anthropic scores are very strong, but they are reported on 35-37 problem subsets, not all 40 Cybench tasks under one uniform public evaluation
Cybench task difficulty varies significantly; the harder tasks (involving novel vulnerability research, multi-step exploitation chains) remain largely unsolved
CTF competition rankings can be misleading — AI systems may excel at pattern-matching common CTF categories while struggling with novel challenges
BountyBench (a newer benchmark evaluating real-world bug bounty performance) may be a more meaningful test of practical cybersecurity capability

What Would Change Our Assessment

Maintain Confirmed: Official Cybench leaderboard or comparable primary sources continue to show frontier systems above 85%, with enough methodological clarity to compare against the original target
Revisit confidence upward: A uniform full-40-task public run or independent reproduction confirms 85%+
Downgrade to On Track: Later clarification shows the subset/system-card results are not comparable to the original Cybench target

Update History

Date	Update
2026-06-06	Official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, Claude Opus 4.7 at 96% on a 35-problem subset, and Claude Opus 4.6 at 93% on a 37-problem subset. Status changed from on-track to confirmed, with confidence raised from 0.60 to 0.75; caveat retained because these are subset/system-card results rather than one uniform full-suite independent run.
2026-06-01	Added UK AISI May 2026 cyber-range/time-horizon results as adjacent evidence. At that point, the specific public 85% Cybench target had not yet been reflected on the tracker; broader cyber benchmark evidence continued to improve.
2026-05-23	Added Microsoft MDASH as adjacent CyberGym evidence, not direct Cybench evidence. The reported 88.45% CyberGym score suggests rapid progress in benchmarked vulnerability-reproduction tasks, but the Cybench-specific 85% threshold remains unmet in public results.
2026-03	Assessment: 85% Cybench target not formally met. However, real-world competition results (top-10 finishes, outright wins) suggest practical capability approaching or exceeding human teams. The gap between benchmark scores and competition performance reflects that purpose-built agents use tool chains and strategies not captured by standard evaluations. Status: On Track. Confidence 0.60.
2026-01	CAI wins Neurogrid CTF outright — 41/45 flags, $50,000 prize. Decisive AI victory in a competitive CTF. Benchmark vs. competition gap widens: specialized systems dominate competitions while general-purpose models remain at ~46% on Cybench.
2025-12	CAI ranks #6 at Dragos OT CTF (1,200+ teams). Notable for OT/ICS security specialization — AI generalizes across cybersecurity subfields. CAIBench meta-benchmark published (arXiv), consolidating evaluation frameworks.
2025-10	CAI competes at Hack The Box Cyber Apocalypse CTF, ranking #22 out of 8,129 teams. First clear demonstration of AI competing meaningfully against thousands of human teams at scale.
2025-08	DEF CON 33 (Las Vegas). AI hacking demonstrations grow in prominence; AI Village showcases AI-assisted CTF tools. Specialized cybersecurity AI systems begin competing in real CTF events.
2025-06	Claude Sonnet 4.5 evaluation shows 75% on base (unguided) subtasks, 46% on Jeopardy-style full challenges. Highest published general-purpose model score but still far from 85%.
2025-05	Cybench paper accepted at ICLR 2025. Baseline frontier model scores: ~10-30% on full CTF challenges depending on difficulty tier. The 85% target appears very distant.