IRJan 9

Statistical Foundations of DIME: Risk Estimation for Practical Index Selection

arXiv:2601.05649h-index: 9
Originality Incremental advance
AI Analysis

For practitioners of dense retrieval, this eliminates costly grid search and enables dynamic dimension selection, improving efficiency without sacrificing performance.

The paper provides a statistically grounded criterion for DIME that selects optimal embedding dimensions per query at inference time, reducing embedding size by ~50% while maintaining effectiveness.

High-dimensional dense embeddings have become central to modern Information Retrieval, but many dimensions are noisy or redundant. Recently proposed DIME (Dimension IMportance Estimation), provides query-dependent scores to identify informative components of embeddings. DIME relies on a costly grid search to select a priori a dimensionality for all the query corpus's embeddings. Our work provides a statistically grounded criterion that directly identifies the optimal set of dimensions for each query at inference time. Experiments confirm achieving parity of effectiveness and reduces embedding size by an average of $\sim50\%$ across different models and datasets at inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes