SE AIMay 6

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

arXiv:2605.0461575.11 citationsh-index: 14

Predicted impact top 20% in SE · last 90 daysOriginality Incremental advance

AI Analysis

For researchers and practitioners in code search, this work provides a more realistic benchmark and a reranker that improves over off-the-shelf models, though the gains are incremental.

The paper introduces CoREB, a contamination-limited multitask benchmark for code search covering retrieval and reranking, and a fine-tuned reranker (CoREB-Reranker) that achieves consistent gains across text-to-code, code-to-text, and code-to-code tasks, while revealing that short keyword queries collapse all models to near-zero nDCG@10.

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

View on arXiv PDF

Similar