IRCLJul 4, 2024

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

MILA
arXiv:2407.03618v1111 citationsh-index: 13Has Code
Originality Synthesis-oriented
AI Analysis

This work provides a faster, open-source alternative for lexical search in Python, benefiting developers and researchers in information retrieval, but it is incremental as it optimizes an existing method rather than introducing a new paradigm.

The paper tackles the problem of slow lexical search in Python by introducing BM25S, an efficient implementation that achieves up to 500x speedup compared to popular frameworks through eager sparse scoring and a novel score shifting method for non-sparse variants.

We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

Code Implementations3 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes