What I cannot execute, I do not understand: Training and Evaluating LLMs on Program Execution Traces
This work addresses the problem of improving LLMs' code understanding for the machine learning and programming communities, particularly those working on code generation and understanding tasks.
The authors tackled the problem of improving large language models' (LLMs) code understanding capabilities by training and evaluating them on program execution traces, achieving around 80% accuracy on CruxEval and MBPP. This was done by modeling real-world program execution traces without requiring manual test annotations.
Code generation and understanding are critical capabilities for large language models (LLMs). Thus, most LLMs are pretrained and fine-tuned on code data. However, these datasets typically treat code as static strings and rarely exploit the dynamic information about their execution. Building upon previous work on trace modeling, we study Execution Tuning (E.T.), a training procedure in which we explicitly model real-world program execution traces without requiring manual test annotations. We train and evaluate models on different execution trace granularities (line and instruction-level) and strategies on the task of output prediction, obtaining around 80% accuracy on CruxEval and MBPP, and showing the advantages of dynamic scratchpads (i.e., self-contained intermediate computations updated by the model rather than accumulated as a history of past computations) on long executions (up to 14k steps). Finally, we discuss E.T.'s practical applications.