Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

arXiv:2605.2850086.7

Predicted impact top 45% in CL · last 90 daysOriginality Incremental advance

AI Analysis

For developers and researchers using LLMs for code generation, this work provides effective uncertainty quantification methods to detect functional errors, improving reliability.

The paper evaluates uncertainty quantification methods for detecting functionally incorrect code generated by LLMs, finding that token-probability-based methods generalize well while NLI-based sampling methods fail. They propose functional equivalence methods, including functional entropy, which achieve top AUROC in 11 out of 15 model-benchmark combinations and best calibration across most settings.

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

View on arXiv PDF

Similar