Gistify! Codebase-Level Understanding via Runtime Execution
This addresses the need for automated, challenging evaluation of coding agents in large codebases, which is incremental as it builds on existing evaluation methods.
The paper tackles the problem of evaluating coding LLMs on large codebases by proposing Gistify, a task where models must create minimal self-contained files to replicate specific functionalities, and finds that current state-of-the-art models struggle, especially with long execution traces.
As coding agents are increasingly deployed in large codebases, the need to automatically design challenging, codebase-level evaluation is central. We propose Gistify, a task where a coding LLM must create a single, minimal, self-contained file that can reproduce a specific functionality of a codebase. The coding LLM is given full access to a codebase along with a specific entrypoint (e.g., a python command), and the generated file must replicate the output of the same command ran under the full codebase, while containing only the essential components necessary to execute the provided command. Success on Gistify requires both structural understanding of the codebase, accurate modeling of its execution flow as well as the ability to produce potentially large code patches. Our findings show that current state-of-the-art models struggle to reliably solve Gistify tasks, especially ones with long executions traces.