Beyond Hard Negatives: The Importance of Score Distribution in Knowledge Distillation for Dense Retrieval
This work addresses a bottleneck in training retrieval models for information retrieval, offering an incremental improvement by optimizing data composition in knowledge distillation.
The paper tackles the problem of knowledge distillation for dense retrieval by showing that focusing only on hard negatives limits learning the teacher's preference structure, and proposes a stratified sampling strategy that uniformly covers the teacher's score spectrum, leading to significant performance improvements over top-K and random sampling on in-domain and out-of-domain benchmarks.
Transferring knowledge from a cross-encoder teacher via Knowledge Distillation (KD) has become a standard paradigm for training retrieval models. While existing studies have largely focused on mining hard negatives to improve discrimination, the systematic composition of training data and the resulting teacher score distribution have received relatively less attention. In this work, we highlight that focusing solely on hard negatives prevents the student from learning the comprehensive preference structure of the teacher, potentially hampering generalization. To effectively emulate the teacher score distribution, we propose a Stratified Sampling strategy that uniformly covers the entire score spectrum. Experiments on in-domain and out-of-domain benchmarks confirm that Stratified Sampling, which preserves the variance and entropy of teacher scores, serves as a robust baseline, significantly outperforming top-K and random sampling in diverse settings. These findings suggest that the essence of distillation lies in preserving the diverse range of relative scores perceived by the teacher.