CR AI MAMay 20

CTFExplorer: Evaluating LLM Offensive Agents Through Multi-Target Web CTF Benchmarking

Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri

arXiv:2602.080230.372 citationsh-index: 41

AI Analysis55

For researchers developing LLM-based offensive security agents, this benchmark addresses the lack of multi-target evaluation that captures strategic decision-making under uncertainty, moving beyond single-target exploitation tests.

CTFExplorer introduces a multi-target web CTF benchmark with 40 vulnerable services to evaluate LLM-based offensive agents on strategic reasoning beyond exploitation, including target prioritization and attack chaining. The benchmark provides a reactive multi-agent framework and structured evaluation for behavioral analysis.

Existing benchmarks for LLM-based offensive security agents use isolated, single-target setups with a known vulnerable service and fixed objective. They measure exploitation effectively, but miss how real Capture-the-Flag (CTF) participants triage unknown surfaces, prioritize targets, and allocate effort under uncertainty. Current evaluations therefore fail to assess strategic reasoning beyond exploitation alone. To address this, we introduce \textit{CTFExplorer}, a benchmark suite that shifts offensive security evaluation toward a multi-target setting, which tests how agents explore, prioritize, and chain attacks. CTFExplorer deploys 40 web-based vulnerable services within a single environment, where agents must autonomously discover, distinguish, and exploit targets without predefined guidance. We also present a reactive multi-agent setup as a reference agent framework and develop an agent-agnostic evaluation framework that records structured reasoning traces for fine-grained assessment. This enables behavioral evaluation beyond binary flag capture, such as how agents manage target selection, handle failed hypotheses, coordinate across multiple stages, and extract security intelligence.

View on arXiv PDF

Similar