Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics
This addresses the issue of hallucinations in software engineering tasks for developers and researchers, providing the first comprehensive analysis in this domain, though it is incremental as it builds on existing hallucination studies.
The paper tackles the problem of hallucinations in language models for code change to natural language generation tasks, such as commit message and code review comment generation, finding that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. It explores detection metrics, showing that combining multiple metrics, including model confidence and feature attribution, substantially improves performance for inference-time detection.
Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50\% of generated code reviews and 20\% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.\footnote{All code and data will be released upon acceptance.