From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing
This work addresses the reliability of LLMs for cybersecurity professionals in penetration testing, though it is incremental as it focuses on evaluating and augmenting existing architectures.
The paper tackled the unclear effectiveness of LLMs in penetration testing by evaluating various LLM-based agents across realistic scenarios and measuring performance and failure patterns, finding that targeted augmentations of core functional capabilities substantially improved modular agent performance in complex tasks.
Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.