Ensuring Functional Correctness of Large Code Models with Selective Generation
This addresses safety-critical code generation for systems requiring higher reliability standards, representing a novel method for a known bottleneck rather than a fundamental paradigm shift.
The paper tackles the problem of code hallucination in large language models by proposing a selective generation approach that abstains from uncertain outputs based on functional correctness evaluated through automatically generated unit tests, achieving theoretical control over false discovery rate and demonstrating reasonable selection efficiency.
The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.