SEApr 17

Evaluating LLM Agents on Automated Software Analysis Tasks

Islem Bouzenia, Cristian Cadar, Michael Pradel

arXiv:2604.1127098.0h-index: 36Has Code

AI Analysis

For researchers and practitioners in software engineering, this work provides the first systematic benchmark and evaluation of LLM agents for automated software analysis, revealing key limitations and design insights.

The paper introduces AnalysisBench, a benchmark for evaluating LLM agents on automated software analysis tasks, and shows that their custom agent, AnalysisAgent, achieves 94% success rate (33/35 tasks) compared to 77% for the best baseline, demonstrating that agent architecture matters more than LLM capability alone.

Numerous software analysis tools exist today, yet applying them to diverse open-source projects remains challenging due to environment setup, dependency resolution, and tool configuration. LLM-based agents offer a potential solution, yet no prior work has systematically studied their effectiveness on the specific task of automated software analysis, which, unlike issue solving or general environment setup, requires installing and configuring a separate analysis tool alongside the target project, generating tool-specific prerequisites, and validating that the tool produces meaningful analysis outputs. We introduce AnalysisBench, a benchmark of 35 tool-project pairs spanning seven analysis tools and ten diverse C/C++ and Java projects, each with a manually constructed reference setup. Using AnalysisBench, we evaluate four agent architectures across four LLM backends. Our custom agent, AnalysisAgent, achieves manually verified success rates of 94% (Gemini-3-Flash, 33/35 tasks), compared to 77% for the best baseline (ExecutionAgent). Beyond quantitative results, we identify key limitations in existing agents, including stage mixing, poor error localization, and premature termination, and show that agentic architecture matters more than LLM capability alone. We further find that whole-program analyses and symbolic execution are the most difficult tasks, that Java toolchains pose greater challenges than C/C++, and that LLM-self-validated success consistently overstates manually verified success.

View on arXiv PDF

Similar