Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks
This addresses the need for better evaluation methods for LLMs in software engineering, though it is incremental as it builds on an existing benchmark.
The authors tackled the problem of evaluating Large Language Models (LLMs) on complex software engineering tasks by proposing an interactive evaluation framework using structured, feedback-driven dialogue, and found that it provides fine-grained diagnostic insights that static benchmarks miss, based on a benchmark of 55 programming tasks with expert-annotated hints.
Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that assesses LLMs on multi-requirement programming tasks through structured, feedback-driven dialogue. Each task is modeled as a requirement dependency graph, and an ``interviewer'' LLM, aware of the ground-truth solution, provides minimal, targeted hints to an ``interviewee'' model to help correct errors and fulfill target constraints. This dynamic protocol enables fine-grained diagnostic insights into model behavior, uncovering strengths and systematic weaknesses that static benchmarks fail to measure. We build on DevAI, a benchmark of 55 curated programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints through expert annotation. Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.