CLAIIRMay 3, 2024

Semi-Parametric Retrieval via Binary Bag-of-Tokens Index

arXiv:2405.01924v2h-index: 18ICLR
Originality Highly original
AI Analysis

This addresses indexing efficiency and cost issues for information retrieval applications, offering a novel hybrid approach that is not purely incremental but combines neural and term-based methods.

The paper tackles the problem of efficient and cost-effective indexing for neural retrieval systems by introducing SiDR, a bi-encoder framework that decouples retrieval index from neural parameters, achieving BM25-like indexing complexity with better effectiveness across 16 benchmarks, such as outperforming BM25 on all in-domain datasets.

Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i) When using an embedding-based index, SiDR exceeds the performance of conventional neural retrievers while maintaining similar training complexity; (ii) When using a tokenization-based index, SiDR drastically reduces indexing cost and time, matching the complexity of traditional term-based retrieval, while consistently outperforming BM25 on all in-domain datasets; (iii) Additionally, we introduce a late parametric mechanism that matches BM25 index preparation time while outperforming other neural retrieval baselines in effectiveness.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes