CL AIJun 25, 2025

SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization

Dhruv Gupta, Gayathri Ganesh Lakshmy, Yiqing Xie

arXiv:2506.20081v26.72 citationsh-index: 1EMNLP

Originality Incremental advance

AI Analysis

This addresses bias in code retrieval for developers using retrieval-augmented code generation, but it is incremental as it builds on existing retrieval methods.

The paper tackled the problem of textual bias in code retrieval, where current retrievers rely on surface-level features and favor well-documented code, and proposed SACL to reduce bias by augmenting semantic information, resulting in improvements such as 12.8% Recall@1 on HumanEval and 4.88% Pass@1 for code generation.

Retrieval-Augmented Code Generation (RACG) is a critical technique for enhancing code generation by retrieving relevant information. In this work, we conduct an in-depth analysis of code retrieval by systematically masking specific features while preserving code functionality. Our discoveries include: (1) although trained on code, current retrievers heavily rely on surface-level textual features (e.g., docstrings, identifier names), and (2) they exhibit a strong bias towards well-documented code, even if the documentation is irrelevant. Based on our discoveries, we propose SACL, a framework that enriches textual information and reduces bias by augmenting code or structural knowledge with semantic information. Extensive experiments show that SACL substantially improves code retrieval (e.g., by 12.8% / 9.4% / 7.0% Recall@1 on HumanEval / MBPP / SWE-Bench-Lite), which also leads to better code generation performance (e.g., by 4.88% Pass@1 on HumanEval).

View on arXiv PDF

Similar