AI scores 85% on Cybench
85% on Cybench, matching a top professional human team on hacking tasks that take those teams 4 hours
What AI 2027 Predicted
As part of the Early 2026 “Coding Automation” scenario, the authors predicted that AI agents would reach 85% on the Cybench benchmark — a suite of 40 professional-level Capture the Flag (CTF) cybersecurity tasks drawn from real CTF competitions. The 85% threshold corresponds to matching a top professional human team on tasks that typically take those teams about 4 hours. This prediction is closely linked to the broader claim that the same training environments that produce strong coding agents also produce competent hackers.
How We Track This
- Cybench official results at cybench.github.io
- AI performance in competitive CTF events (DEF CON, Hack The Box, etc.)
- Specialized cybersecurity AI systems (e.g., CAI, BountyBench)
- Academic evaluations of frontier models on cybersecurity tasks
- ICLR 2025 Cybench paper and follow-up evaluations
Current Evidence
Progress is significant but the 85% threshold has not been publicly reached on Cybench specifically:
- Claude Sonnet 4.5 achieves the highest Cybench score among general-purpose models at 46% (Jeopardy-style CTF), with 75% on base-level tasks
- Specialized cybersecurity AI systems show much stronger performance in real competitions:
- Cybersecurity AI (CAI) achieved #1 at Neurogrid CTF (41/45 flags, $50,000 prize) in 2025
- CAI ranked #6 at Dragos OT CTF (among 1,200+ teams) and #22 at Cyber Apocalypse (8,129 teams)
- CAI operates 3,600x faster than humans and at 156x lower cost
- The gap between general-purpose models (~46% on Cybench) and specialized systems (dominating real CTF competitions) suggests that purpose-built agents are approaching or exceeding human-team performance, even if the Cybench-specific number hasn’t reached 85%
Key nuance: Real-world CTF competition performance appears to be advancing faster than formal benchmark scores, partly because purpose-built cybersecurity agents use tool chains and strategies not captured by standard model evaluations.
Counterevidence & Limitations
- The 85% Cybench target is specifically about matching human teams on 4-hour tasks; most published AI scores are on shorter tasks where models perform better
- General-purpose model scores (46%) are far from 85%; only specialized systems show competitive performance
- Cybench task difficulty varies significantly; the harder tasks (involving novel vulnerability research, multi-step exploitation chains) remain largely unsolved
- CTF competition rankings can be misleading — AI systems may excel at pattern-matching common CTF categories while struggling with novel challenges
- BountyBench (a newer benchmark evaluating real-world bug bounty performance) may be a more meaningful test of practical cybersecurity capability
What Would Change Our Assessment
- Upgrade to Confirmed: A frontier model or well-known agentic system scores 85%+ on Cybench, or dominates multiple major CTF competitions at human-team-equivalent or better performance
- Upgrade to Ahead: 85% is reached before mid-2026
- Downgrade to Behind: By end of 2026, general-purpose models remain below 60% and specialized systems show plateauing competition results
Update History
| Date | Update |
|---|---|
| 2025-05 | Cybench paper accepted at ICLR 2025. Baseline frontier model scores: ~10-30% on full CTF challenges depending on difficulty tier. The 85% target appears very distant. |
| 2025-06 | Claude Sonnet 4.5 evaluation shows 75% on base (unguided) subtasks, 46% on Jeopardy-style full challenges. Highest published general-purpose model score but still far from 85%. |
| 2025-08 | DEF CON 33 (Las Vegas). AI hacking demonstrations grow in prominence; AI Village showcases AI-assisted CTF tools. Specialized cybersecurity AI systems begin competing in real CTF events. |
| 2025-10 | CAI competes at Hack The Box Cyber Apocalypse CTF, ranking #22 out of 8,129 teams. First clear demonstration of AI competing meaningfully against thousands of human teams at scale. |
| 2025-12 | CAI ranks #6 at Dragos OT CTF (1,200+ teams). Notable for OT/ICS security specialization — AI generalizes across cybersecurity subfields. CAIBench meta-benchmark published (arXiv), consolidating evaluation frameworks. |
| 2026-01 | CAI wins Neurogrid CTF outright — 41/45 flags, $50,000 prize. Decisive AI victory in a competitive CTF. Benchmark vs. competition gap widens: specialized systems dominate competitions while general-purpose models remain at ~46% on Cybench. |
| 2026-03 | Assessment: 85% Cybench target not formally met. However, real-world competition results (top-10 finishes, outright wins) suggest practical capability approaching or exceeding human teams. The gap between benchmark scores and competition performance reflects that purpose-built agents use tool chains and strategies not captured by standard evaluations. Status: On Track. Confidence 0.60. |