IR CLJun 26, 2022

Are We There Yet? A Decision Framework for Replacing Term Based Retrieval with Dense Retrieval Systems

Sebastian Hofstätter, Nick Craswell, Bhaskar Mitra, Hamed Zamani, Allan Hanbury

Microsoft

arXiv:2206.12993v16.55 citationsh-index: 45

Originality Synthesis-oriented

AI Analysis

This work provides a practical framework for search system developers to assess the readiness of dense retrieval for deployment, addressing trade-offs in performance and costs, though it is incremental as it builds on existing retrieval evaluation methods.

The authors tackled the problem of determining when dense retrieval (DR) systems are ready to replace term-based retrieval by proposing a decision framework that evaluates effectiveness, costs, and guardrail criteria. They demonstrated the framework on a Web ranking scenario, finding that state-of-the-art DR models show strong average performance and robustness in guardrail tests.

Recently, several dense retrieval (DR) models have demonstrated competitive performance to term-based retrieval that are ubiquitous in search systems. In contrast to term-based matching, DR projects queries and documents into a dense vector space and retrieves results via (approximate) nearest neighbor search. Deploying a new system, such as DR, inevitably involves tradeoffs in aspects of its performance. Established retrieval systems running at scale are usually well understood in terms of effectiveness and costs, such as query latency, indexing throughput, or storage requirements. In this work, we propose a framework with a set of criteria that go beyond simple effectiveness measures to thoroughly compare two retrieval systems with the explicit goal of assessing the readiness of one system to replace the other. This includes careful tradeoff considerations between effectiveness and various cost factors. Furthermore, we describe guardrail criteria, since even a system that is better on average may have systematic failures on a minority of queries. The guardrails check for failures on certain query characteristics and novel failure types that are only possible in dense retrieval systems. We demonstrate our decision framework on a Web ranking scenario. In that scenario, state-of-the-art DR models have surprisingly strong results, not only on average performance but passing an extensive set of guardrail tests, showing robustness on different query characteristics, lexical matching, generalization, and number of regressions. It is impossible to predict whether DR will become ubiquitous in the future, but one way this is possible is through repeated applications of decision processes such as the one presented here.

View on arXiv PDF

Similar