SE LGJan 22, 2025

Correctness Assessment of Code Generated by Large Language Models Using Internal Representations

Tuan-Dung Bui, Thanh Trong Vu, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

arXiv:2501.12934v219.125 citationsh-index: 9Has CodeJ Syst Softw

Originality Incremental advance

AI Analysis

This addresses the challenge of ensuring code correctness in AI-driven software development, offering a more proactive quality assurance mechanism, though it is incremental as it builds on existing LLM methods.

The paper tackles the problem of assessing the correctness of code generated by Large Language Models (LLMs) by introducing OPENIA, a white-box framework that leverages internal representations during code generation, resulting in up to 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios.

Ensuring the correctness of code generated by Large Language Models (LLMs) presents a significant challenge in AI-driven software development. Existing approaches predominantly rely on black-box (closed-box) approaches that evaluate correctness post-generation, failing to utilize the rich insights embedded in the LLMs' internal states during code generation. In this paper, we introduce OPENIA, a novel white-box (open-box) framework that leverages these internal representations to assess the correctness of LLM-generated code. OPENIA systematically analyzes the intermediate states of representative open-source LLMs specialized for code, including DeepSeek-Coder, CodeLlama, and MagicCoder, across diverse code generation benchmarks. Our empirical analysis reveals that these internal representations encode latent information, which strongly correlates with the correctness of the generated code. Building on these insights, OPENIA uses a white-box/open-box approach to make informed predictions about code correctness, offering significant advantages in adaptability and robustness over traditional classification-based methods and zero-shot approaches. Experimental results demonstrate that OPENIA consistently outperforms baseline models, achieving higher accuracy, precision, recall, and F1-Scores with up to a 2X improvement in standalone code generation and a 46% enhancement in repository-specific scenarios. By unlocking the potential of in-process signals, OPENIA paves the way for more proactive and efficient quality assurance mechanisms in LLM-assisted code generation.

View on arXiv PDF Code

Similar