SEAIMay 6

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

arXiv:2605.0461575.11 citationsh-index: 14
Predicted impact top 20% in SE · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners in code search, this work provides a more realistic benchmark and a reranker that improves over off-the-shelf models, though the gains are incremental.

The paper introduces CoREB, a contamination-limited multitask benchmark for code search covering retrieval and reranking, and a fine-tuned reranker (CoREB-Reranker) that achieves consistent gains across text-to-code, code-to-text, and code-to-code tasks, while revealing that short keyword queries collapse all models to near-zero nDCG@10.

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes