SE CLApr 28, 2025

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Kang Yang, Xinjun Mao, Shangwen Wang, Yanlin Wang, Tanghaoran Zhang, Bo Lin, Yihao Qin, Zhang Zhang, Yao Lu, Kamal Al-Sabahi

arXiv:2504.19444v18.05 citationsh-index: 9ICPC

Originality Incremental advance

AI Analysis

This work addresses the issue of data quality for researchers and practitioners in code intelligence, offering an incremental improvement by leveraging LLMs to enhance pre-training datasets.

The paper tackled the problem of outdated human-written comments in pre-training datasets for code models, which degrade performance, by replacing them with LLM-generated comments and found that models trained on this rebuilt data outperformed those using original comments in tasks like code summarization, generation, and translation.

Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.

View on arXiv PDF

Similar