Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example
This addresses a critical gap for software engineering researchers by providing a reusable and extensible tool for rapid experimentation, though it is incremental as it builds on existing concepts of test-driven approaches.
The paper tackles the lack of standardized tools for Test-Driven Software Experiments (TDSEs) in empirical software engineering by introducing LASSO, a general-purpose analysis platform that enables efficient evaluation of run-time semantics and execution characteristics, as demonstrated in an example assessing LLM reliability for code generation.
Empirical software engineering faces a critical gap: the lack of standardized tools for rapid development and execution of Test-Driven Software Experiments (TDSEs) -- that is, experiments that involve the execution of software subjects and the observation and analysis of their "de facto" run-time behavior. In this paper we present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs. By empowering users with an executable scripting language to design and execute TDSEs, LASSO enables efficient evaluation of run-time semantics and execution characteristics in addition to statically determined properties. We present an example TDSE that demonstrates the practical benefits of LASSO's scripting capabilities for assessing the reliability of LLMs for code generation by means of a self-contained, reusable and extensible study script. The LASSO platform and live pipeline examples are publicly available at: https://softwareobservatorium.github.io/.