AI scores 85% on Cybench

Author Johannes Haus
Last updated
Confirmed · Security · 75% confidence
Predicted: Early 2026 · Updated: 2026-06-06 · Source: AI 2027, page 7, footnote 15 (Early 2026: Coding Automation)
85% on Cybench, matching a top professional human team on hacking tasks that take those teams 4 hours

At a glance

  • Assessment: Confirmed
  • Confidence: 75%
  • Predicted timing: Early 2026
  • Primary source: AI 2027, page 7, footnote 15 (Early 2026: Coding Automation)

What AI 2027 Predicted

As part of the Early 2026 “Coding Automation” scenario, the authors predicted that AI agents would reach 85% on the Cybench benchmark — a suite of 40 professional-level Capture the Flag (CTF) cybersecurity tasks drawn from real CTF competitions. The 85% threshold corresponds to matching a top professional human team on tasks that typically take those teams about 4 hours. This prediction is closely linked to the broader claim that the same training environments that produce strong coding agents also produce competent hackers.

How We Track This

  • Cybench official results at cybench.github.io
  • AI performance in competitive CTF events (DEF CON, Hack The Box, etc.)
  • Specialized cybersecurity AI systems (e.g., CAI, BountyBench)
  • Academic evaluations of frontier models on cybersecurity tasks
  • ICLR 2025 Cybench paper and follow-up evaluations

Current Evidence

The 85% numerical threshold has now been reached on Cybench, though with important comparability caveats:

  • The official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, citing the Claude Mythos Preview system card
  • The same leaderboard lists Claude Opus 4.7 at 96% on a 35-problem subset and Claude Opus 4.6 at 93% on a 37-problem subset
  • Older full-40-task public results remained much lower, and the newer Anthropic scores are subset/system-card results rather than a uniform full-suite independent run
  • Claude Sonnet 4.5 previously achieved the highest broadly discussed general-purpose score at 60% on a 39-problem subset, after earlier general-purpose scores were much lower
  • Specialized cybersecurity AI systems show much stronger performance in real competitions:
    • Cybersecurity AI (CAI) achieved #1 at Neurogrid CTF (41/45 flags, $50,000 prize) in 2025
    • CAI ranked #6 at Dragos OT CTF (among 1,200+ teams) and #22 at Cyber Apocalypse (8,129 teams)
    • CAI operates 3,600x faster than humans and at 156x lower cost
  • The narrowing gap between general-purpose model scores and specialized systems suggests that purpose-built agents are approaching or exceeding human-team performance
  • Microsoft MDASH scored 88.45% on CyberGym in May 2026, using a multi-model agentic scanning harness across 1,507 real-world vulnerability reproduction tasks. This is adjacent evidence for rapid progress in cyber benchmarks, but it is not a Cybench result and should not be counted as meeting the specific 85% Cybench target.
  • UK AISI’s May 2026 cyber update is also adjacent evidence rather than a Cybench result. AISI reported that Mythos Preview completed both AISI cyber ranges end-to-end, while GPT-5.5 and Mythos Preview showed near-saturation on several long tasks in AISI’s narrow cyber suite. This supports the broader cyber-capability picture alongside the direct Cybench leaderboard result.

Key nuance: The headline Cybench threshold is crossed, but interpretation depends on whether subset/system-card results are accepted as equivalent to the original full-benchmark target. Real-world CTF competition performance also appears to be advancing quickly, partly because purpose-built cybersecurity agents use tool chains and strategies not captured by standard model evaluations.

Counterevidence & Limitations

  • The 85% Cybench target is specifically about matching human teams on 4-hour tasks; subset/system-card scores may not be directly comparable to the original full-suite framing
  • The latest Anthropic scores are very strong, but they are reported on 35-37 problem subsets, not all 40 Cybench tasks under one uniform public evaluation
  • Cybench task difficulty varies significantly; the harder tasks (involving novel vulnerability research, multi-step exploitation chains) remain largely unsolved
  • CTF competition rankings can be misleading — AI systems may excel at pattern-matching common CTF categories while struggling with novel challenges
  • BountyBench (a newer benchmark evaluating real-world bug bounty performance) may be a more meaningful test of practical cybersecurity capability

What Would Change Our Assessment

  • Maintain Confirmed: Official Cybench leaderboard or comparable primary sources continue to show frontier systems above 85%, with enough methodological clarity to compare against the original target
  • Revisit confidence upward: A uniform full-40-task public run or independent reproduction confirms 85%+
  • Downgrade to On Track: Later clarification shows the subset/system-card results are not comparable to the original Cybench target

Update History

DateUpdate
2026-06-06Official Cybench leaderboard now lists Claude Mythos Preview at 100% end-to-end solved on a 35-problem subset, Claude Opus 4.7 at 96% on a 35-problem subset, and Claude Opus 4.6 at 93% on a 37-problem subset. Status changed from on-track to confirmed, with confidence raised from 0.60 to 0.75; caveat retained because these are subset/system-card results rather than one uniform full-suite independent run.
2026-06-01Added UK AISI May 2026 cyber-range/time-horizon results as adjacent evidence. At that point, the specific public 85% Cybench target had not yet been reflected on the tracker; broader cyber benchmark evidence continued to improve.
2026-05-23Added Microsoft MDASH as adjacent CyberGym evidence, not direct Cybench evidence. The reported 88.45% CyberGym score suggests rapid progress in benchmarked vulnerability-reproduction tasks, but the Cybench-specific 85% threshold remains unmet in public results.
2026-03Assessment: 85% Cybench target not formally met. However, real-world competition results (top-10 finishes, outright wins) suggest practical capability approaching or exceeding human teams. The gap between benchmark scores and competition performance reflects that purpose-built agents use tool chains and strategies not captured by standard evaluations. Status: On Track. Confidence 0.60.
2026-01CAI wins Neurogrid CTF outright — 41/45 flags, $50,000 prize. Decisive AI victory in a competitive CTF. Benchmark vs. competition gap widens: specialized systems dominate competitions while general-purpose models remain at ~46% on Cybench.
2025-12CAI ranks #6 at Dragos OT CTF (1,200+ teams). Notable for OT/ICS security specialization — AI generalizes across cybersecurity subfields. CAIBench meta-benchmark published (arXiv), consolidating evaluation frameworks.
2025-10CAI competes at Hack The Box Cyber Apocalypse CTF, ranking #22 out of 8,129 teams. First clear demonstration of AI competing meaningfully against thousands of human teams at scale.
2025-08DEF CON 33 (Las Vegas). AI hacking demonstrations grow in prominence; AI Village showcases AI-assisted CTF tools. Specialized cybersecurity AI systems begin competing in real CTF events.
2025-06Claude Sonnet 4.5 evaluation shows 75% on base (unguided) subtasks, 46% on Jeopardy-style full challenges. Highest published general-purpose model score but still far from 85%.
2025-05Cybench paper accepted at ICLR 2025. Baseline frontier model scores: ~10-30% on full CTF challenges depending on difficulty tier. The 85% target appears very distant.