CRAIOct 22, 2024

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

arXiv:2410.17141v432 citationsh-index: 3UMAP
Originality Incremental advance
AI Analysis

This addresses the problem of cybersecurity vulnerabilities for organizations by providing a benchmark to drive progress in AI-assisted penetration testing, though it is incremental as it builds on existing tools like PentestGPT.

The paper tackles the lack of comprehensive benchmarks for evaluating large language models (LLMs) in automated penetration testing by introducing a novel open benchmark, finding that current models like GPT-4o and LLama 3.1-405B fall short of performing end-to-end penetration testing even with minimal human assistance.

Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, automated, end-to-end penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark for LLM-based automated penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and LLama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while LLama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing end-to-end penetration testing even with some minimal human assistance. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes