SE HCMay 10

Generating Complex Code Analyzers from Natural Language Questions

Amirmohammad Nazari, Sadra Sabouri, Wang Bill Zhu, Robin Jia, Souti Chattopadhyay, Mukund Raghothaman

arXiv:2605.0930477.6

AI Analysis

For developers working with large codebases, Merlin provides a practical system that combines LLMs with program analysis to answer complex code questions, significantly improving accuracy and efficiency.

Merlin integrates an LLM with CodeQL to answer free-form natural language questions about large codebases, using a RAG-based iterative query-generation approach and a novel self-test technique. In user studies, it increased task accuracy by 3.8x and reduced task completion time by 31%.

Many software development tasks, such as implementing features and fixing bugs, begin with developers posing questions about a codebase. However, answering questions about codebases that span millions of lines of code across thousands of files is non-trivial. Standard tools like grep cannot answer questions requiring semantic or inter-procedural reasoning, and large language models (LLMs) struggle with large codebases due to resource and context constraints. In this paper, we present Merlin, a new system for answering free-form questions that require analytical reasoning about code. Merlin integrates an LLM with CodeQL, a program analysis framework that supports expressive queries over large codebases. We face two principal challenges in the design of such systems: First, program analysis queries are diverse and semantically complex; as a result, even syntactically well-formed queries frequently produce degenerate/empty results. Furthermore, relatively few CodeQL queries are available online, limiting the out-of-the-box effectiveness of LLMs as CodeQL query generators. We address these challenges by developing a RAG-based iterative query-generation approach and a novel self-test technique. Our query debugging technique builds on the idea of assistive queries, which generate concrete witnesses that expose and explain semantic flaws in candidate queries. We evaluate Merlin through both experimental and user studies. Over a set of natural language questions derived from common bug-finding tasks, Merlin discovered not only the majority of software issues reported by other approaches, but also issues that would have otherwise remained undetected. Through a within-subject user study, we found that access to Merlin increased task accuracy by an average of 3.8* and simultaneously reduced the time for programmers to complete all tasks by 31%.

View on arXiv PDF

Similar