Measuring LLM Code Generation Stability via Structural Entropy
This work addresses the need for reliable evaluation of LLM code generation stability for real-world development, though it is incremental as it extends prior structural-entropy concepts to the program domain.
The paper tackled the problem of assessing the stability of code generation from large language models by developing reference-free, language-agnostic metrics based on structural entropy and abstract syntax tree analysis, demonstrating that these metrics reveal nuances in model consistency and robustness across standard tasks.
Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.