Themisto: Jupyter-Based Runtime Benchmark
This work addresses a critical but understudied problem for developers and researchers in AI and software engineering by providing a benchmark to assess LLMs' integration of runtime information, though it is incremental as it focuses on evaluation rather than new methods.
The authors tackled the problem of evaluating large language models' ability to use runtime information for code prediction and generation by introducing Themisto, a Jupyter-based benchmark. They found that current LLMs perform poorly on these tasks, highlighting a gap in leveraging runtime context.
In this work, we present a benchmark that consists of Jupyter notebooks development trajectories and allows measuring how large language models (LLMs) can leverage runtime information for predicting code output and code generation. We demonstrate that the current generation of LLMs performs poorly on these tasks and argue that there exists a significantly understudied domain in the development of code-based models, which involves incorporating the runtime context.