IRCLMay 2

The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles

arXiv:2605.014077.0h-index: 1
AI Analysis

For researchers working on SPLADE-style neural IR models, this study provides empirical insights into the misalignment between MLM pre-training and retrieval fine-tuning, highlighting trade-offs between effectiveness and efficiency.

This paper investigates how different pre-training datasets and options affect the fine-tuning of Expanded-SPLADE (ESPLADE) models for neural IR, finding that models pre-trained on general corpora with higher learning rates yield better retrieval effectiveness despite lower MLM accuracy, and that stricter pruning increases retrieval cost and variance in posting list length.

Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes