CVAIMar 6

Visual Words Meet BM25: Sparse Auto-Encoder Visual Word Scoring for Image Retrieval

arXiv:2603.05781v1h-index: 5
Predicted impact top 71% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work provides an efficient and interpretable first-stage image retrieval method for large-scale applications, particularly benefiting systems that require high recall with reduced computational cost for subsequent dense reranking.

This paper introduces BM25-V, a method that applies Okapi BM25 scoring to sparse visual-word activations derived from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features for image retrieval. It achieves Recall@200 of at least 0.993 across seven benchmarks, enabling a two-stage retrieval pipeline that recovers near-dense accuracy within 0.2% on average by reranking only 200 candidates.

Dense image retrieval is accurate but offers limited interpretability and attribution, and it can be compute-intensive at scale. We present \textbf{BM25-V}, which applies Okapi BM25 scoring to sparse visual-word activations from a Sparse Auto-Encoder (SAE) on Vision Transformer patch features. Across a large gallery, visual-word document frequencies are highly imbalanced and follow a Zipfian-like distribution, making BM25's inverse document frequency (IDF) weighting well suited for suppressing ubiquitous, low-information words and emphasizing rare, discriminative ones. BM25-V retrieves high-recall candidates via sparse inverted-index operations and serves as an efficient first-stage retriever for dense reranking. Across seven benchmarks, BM25-V achieves Recall@200 $\geq$ 0.993, enabling a two-stage pipeline that reranks only $K{=}200$ candidates per query and recovers near-dense accuracy within $0.2$\% on average. An SAE trained once on ImageNet-1K transfers zero-shot to seven fine-grained benchmarks without fine-tuning, and BM25-V retrieval decisions are attributable to specific visual words with quantified IDF contributions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes