IRCLDec 2, 2022

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking

IBM
arXiv:2212.01340v1226 citationsh-index: 76
Originality Synthesis-oriented
AI Analysis

This addresses a gap in benchmarking for information retrieval systems, particularly for researchers and practitioners needing deployable solutions, though it is incremental as it builds on existing benchmarks.

The paper tackles the problem that current neural information retrieval benchmarks focus only on accuracy, ignoring efficiency costs like latency and hardware, which are critical for real-world deployment. It demonstrates on MS MARCO and XOR-TyDi that the best IR system choice changes when efficiency metrics are included, advocating for more holistic evaluation.

Neural information retrieval (IR) systems have progressed rapidly in recent years, in large part due to the release of publicly available benchmarking tasks. Unfortunately, some dimensions of this progress are illusory: the majority of the popular IR benchmarks today focus exclusively on downstream task accuracy and thus conceal the costs incurred by systems that trade away efficiency for quality. Latency, hardware cost, and other efficiency considerations are paramount to the deployment of IR systems in user-facing settings. We propose that IR benchmarks structure their evaluation methodology to include not only metrics of accuracy, but also efficiency considerations such as a query latency and the corresponding cost budget for a reproducible hardware setting. For the popular IR benchmarks MS MARCO and XOR-TyDi, we show how the best choice of IR system varies according to how these efficiency considerations are chosen and weighed. We hope that future benchmarks will adopt these guidelines toward more holistic IR evaluation.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes