Autonomous LLM Agents & CTFs: A Second Look
For researchers and practitioners in offensive security, this work provides a rigorous benchmark and reveals that current LLM agents, despite progress, still face persistent barriers to human-level capability.
The paper evaluates LLM agents on 30 web-based CTF challenges, finding that general-purpose agents like claude-code match custom architectures (19/30 solved), but all agents fall short of human-level performance, with structured orchestration improving consistency and reducing costs.
Large Language Model (LLM) agents are increasingly proposed to automate offensive security tasks, with recent studies reporting near human-level success rates in Capture-the-Flag (CTF) challenges. We here revisit these results, providing a second look at these claims. We engineer different agent architectures of increasing complexity and modularity on 30 web-based CTFs challenges spanning 14 vulnerability classes. We instantiate these agents with multiple LLM backbones, and compare them with claude-code, a general-purpose agent that automatically determines its internal architecture. Our evaluation yields three main findings. First, claude-code achieves performance comparable to the engineered architectures (19/30 solved tasks), suggesting that general-purpose agents are strong baselines for offensive security tasks. Second, both our architectures and claude-code struggle in the same challenge categories, revealing persistent barriers that keep current agents below human-level capability. Third, by leveraging our manually designed architectures we can systematically measure the impact of additional components, finding that structured orchestration of specialized roles outperforms monolithic designs, improving run-to-run consistency, and reducing execution costs.