SE AIMay 31, 2025

CoQuIR: A Comprehensive Benchmark for Code Quality-Aware Information Retrieval

Jiahui Geng, Fengyu Cai, Shaobo Cui, Qing Li, Liangwei Chen, Chenyang Lyu, Haonan Li, Derui Zhu, Walter Pretschner, Heinz Koeppl, Fakhri Karray

arXiv:2506.11066v29.83 citationsh-index: 5Has Code

Originality Incremental advance

AI Analysis

This addresses the need for more trustworthy software development tools by providing a comprehensive benchmark for quality-aware code retrieval, though it is incremental in extending existing retrieval evaluation frameworks.

The authors tackled the problem that current code retrieval benchmarks focus on functional relevance but neglect software quality dimensions, by introducing CoQuIR, a large-scale multilingual benchmark with quality annotations for 42,725 queries and 134,907 code snippets across 11 languages, and found that top-performing models often fail to distinguish buggy or insecure code.

Code retrieval is essential in modern software development, as it boosts code reuse and accelerates debugging. However, current benchmarks primarily emphasize functional relevance while neglecting critical dimensions of software quality. Motivated by this gap, we introduce CoQuIR, the first large-scale, multilingual benchmark specifically designed to evaluate quality-aware code retrieval across four key dimensions: correctness, efficiency, security, and maintainability. CoQuIR provides fine-grained quality annotations for 42,725 queries and 134,907 code snippets in 11 programming languages, and is accompanied by two quality-centric evaluation metrics: Pairwise Preference Accuracy and Margin-based Ranking Score. Using CoQuIR, we benchmark 23 retrieval models, covering both open-source and proprietary systems, and find that even top-performing models frequently fail to distinguish buggy or insecure code from their more robust counterparts. Furthermore, we conduct preliminary investigations into training methods that explicitly encourage retrievers to recognize code quality. Using synthetic datasets, we demonstrate promising improvements in quality-aware metrics across various models, without sacrificing semantic relevance. Downstream code generation experiments further validate the effectiveness of our approach. Overall, our work highlights the importance of integrating quality signals into code retrieval systems, laying the groundwork for more trustworthy and robust software development tools.

View on arXiv PDF

Similar