CRAILGJun 3

CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities

arXiv:2606.0446097.5Has Code
AI Analysis

For AI safety and cybersecurity researchers, it provides a realistic, large-scale benchmark to measure autonomous vulnerability handling capabilities.

CyberGym-E2E introduces a scalable benchmark of 920 real-world vulnerabilities across 139 open-source projects to evaluate AI agents on the full end-to-end cybersecurity lifecycle (discovery, PoC generation, patching).

AI has the potential to transform cybersecurity by enabling systems that can autonomously detect, analyze, and remediate software vulnerabilities. However, existing cybersecurity evaluations of AI systems are limited in scale or scope, and fail to capture the end-to-end lifecycle of real-world software vulnerability discovery and remediation. To address this gap, we propose CyberGym-E2E, a large-scale and realistic end-to-end cybersecurity benchmark that comprehensively evaluates AI agents' abilities across the full lifecycle of vulnerability discovery, PoC generation, and patch generation. CyberGym-E2E is comprehensive and scalable, as we build an automated, agent-enhanced pipeline for transforming open-source vulnerability data into realistic evaluation environments. Currently, the benchmark consists of 920 real-world vulnerabilities across 139 different open-source projects.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes