An Empirical Study of Sustainability in Prompt-driven Test Script Generation Using Small Language Models
This work addresses sustainability concerns for software developers and researchers using SLMs in automated testing, but it is incremental as it extends existing analyses from large to small language models.
This study tackled the environmental impact of small language models (SLMs) in prompt-driven unit-test script generation, finding that different SLMs (2B-8B parameters) exhibit distinct sustainability profiles, with some favoring lower energy use and faster execution while others maintain higher stability or coverage under comparable conditions.
The increasing use of language models in automated test script generation raises concerns about their environmental impact, yet existing sustainability analyses focus predominantly on large language models. As a result, the energy and carbon characteristics of small language models (SLMs) during prompt-driven unit-test script generation remain largely unexplored. To address this gap, this study empirically examines the environmental and performance tradeoffs of SLMs (in the 2B-8B parameter range) using the HumanEval benchmark and adaptive prompt variants (based on the Anthropic template). The analysis uses CodeCarbon to characterize energy consumption carbon emissions and duration under controlled conditions, with unit-test script coverage serving as an initial proxy for generated test quality. Our results show that different SLMs exhibit distinct sustainability profiles - some favor lower energy use and faster execution, while others maintain higher stability or coverage under comparable conditions. Overall, this work provides focused empirical evidence on sustainable SLM-based test script generation, clarifying how prompt structure and model selection jointly shape environmental and performance outcomes.