Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
This work highlights critical robustness issues in dense retrievers used in IR and RAG systems, which can lead to downstream failures, making it significant for researchers and practitioners in information retrieval and AI applications.
The study identified that dense retrieval models, such as Dragon+ and Contriever, exhibit strong biases favoring shorter documents, early positions, repeated entities, and literal matches, often ignoring factual evidence, leading to catastrophic performance drops where answer-containing documents are selected less than 10% of the time in biased scenarios, and causing a 34% performance decline in downstream RAG applications.
Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid downstream failures. In this work, we repurpose a relation extraction dataset (e.g., Re-DocRED) to design controlled experiments that quantify the impact of heuristic biases, such as a preference for shorter documents, on retrievers like Dragon+ and Contriever. We uncover major vulnerabilities, showing retrievers favor shorter documents, early positions, repeated entities, and literal matches, all while ignoring the answer's presence! Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 10% of cases over a synthetic biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than providing no documents at all. https://huggingface.co/datasets/mohsenfayyaz/ColDeR