Lighting the Way for BRIGHT: Reproducible Baselines with Anserini, Pyserini, and RankLLM
For researchers building retrieval-augmented generation systems, this work provides reproducible baselines and highlights data quality issues in the BRIGHT benchmark.
The paper reproduces and extends the BRIGHT retrieval benchmark for LLMs, identifying that query-side BM25 (BM25Q) outperforms standard BM25 on long queries, but its gains are largely specific to BRIGHT; fusion with standard BM25 provides more consistent improvements across datasets.
Retrieval benchmarks for large language models (LLMs) should reflect the long, reasoning-intensive queries typical of retrieval-augmented generation (RAG). We present a systematic study of BRIGHT, a reasoning-focused retrieval benchmark, along with strong, reproducible reference methods integrated into Anserini, Pyserini, and RankLLM. We evaluate lexical, sparse, dense, and fusion-based retrievers, as well as LLM rerankers, under long-query settings. In reproducing BRIGHT's lexical baseline, we identify a key under-documented detail: query-side BM25 (BM25Q), which applies BM25 weighting to the query itself. On long, multi-sentence queries, BM25Q consistently outperforms standard BM25, making it the strongest lexical baseline for reasoning-oriented retrieval. We further audit the BRIGHT corpus, uncovering data quality issues that impact evaluation, and offer mitigation. Finally, we study the generalizability of BM25Q across five additional benchmarks, finding its gains largely specific to BRIGHT, while fusion with standard BM25 provides the most consistent improvements across datasets.