CLDec 22, 2025

CodeSimpleQA: Scaling Factuality in Code Large Language Models

arXiv:2512.19424v12 citationsh-index: 23
Originality Incremental advance
AI Analysis

This addresses the challenge of reliable code generation for developers and AI practitioners, though it is incremental as it builds on existing alignment methods.

The paper tackles the problem of ensuring factual accuracy in code large language models (LLMs) by introducing CodeSimpleQA, a bilingual benchmark for evaluating code factuality, and shows that their post-training framework improves performance over base models.

Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes