CLLGJun 10, 2025

Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings

arXiv:2506.08592v24 citationsh-index: 39Has CodeEMNLP
Originality Incremental advance
AI Analysis

This addresses a limitation in retrieval systems for users needing precise information, but it is incremental as it builds on existing encoder methods.

The paper tackles the problem of dense retrievers failing on simple queries due to embeddings' inability to recognize fine-grained entities or events, and shows that fine-tuning with proposed data generation strategies enables a small 0.1B encoder to outperform a state-of-the-art 7B model.

This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes