44.5IRJun 4
ColBERTSaR: Sparsified ColBERT Index via Product QuantizationEugene Yang, Andrew Yates, Dawn Lawrie et al.
While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.
SDSep 3, 2024Code
LSTMSE-Net: Long Short Term Speech Enhancement Network for Audio-visual Speech EnhancementArnav Jain, Jasmer Singh Sanjotra, Harshvardhan Choudhary et al.
In this paper, we propose long short term memory speech enhancement network (LSTMSE-Net), an audio-visual speech enhancement (AVSE) method. This innovative method leverages the complementary nature of visual and audio information to boost the quality of speech signals. Visual features are extracted with VisualFeatNet (VFN), and audio features are processed through an encoder and decoder. The system scales and concatenates visual and audio features, then processes them through a separator network for optimized speech enhancement. The architecture highlights advancements in leveraging multi-modal data and interpolation techniques for robust AVSE challenge systems. The performance of LSTMSE-Net surpasses that of the baseline model from the COG-MHEAR AVSE Challenge 2024 by a margin of 0.06 in scale-invariant signal-to-distortion ratio (SISDR), $0.03$ in short-time objective intelligibility (STOI), and $1.32$ in perceptual evaluation of speech quality (PESQ). The source code of the proposed LSTMSE-Net is available at \url{https://github.com/mtanveer1/AVSEC-3-Challenge}.
IRFeb 24Code
Multi-Vector Index Compression in Any ModalityHanxiang Qin, Alexander Martin, Rohan Jha et al.
We study efficient multi-vector retrieval for late interaction in any modality. Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos, but its computation and storage costs grow linearly with document length, making it costly for image-, video-, and audio-rich corpora. To address this limitation, we explore query-agnostic methods for compressing multi-vector document representations under a constant vector budget. We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC). AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation. Evaluating these methods on retrieval tasks spanning text (BEIR), visual-document (ViDoRe), and video (MSR-VTT, MultiVENT 2.0), we show that attention-guided clustering consistently outperforms other parameterized compression methods (sequence resizing and memory tokens), provides greater flexibility in index size than non-parametric hierarchical clustering, and achieves competitive or improved performance compared to a full, uncompressed index. The source code is available at: github.com/hanxiangqin/omni-col-press.
IRAug 29, 2024
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction RetrieverRohan Jha, Bo Wang, Michael Günther et al.
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT's late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this work we propose a number of incremental improvements to the ColBERT model architecture and training pipeline, using methods shown to work in the more mature single-vector embedding model training paradigm, particularly those that apply to heterogeneous multilingual data or boost efficiency with little tradeoff. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks.
75.8IRMar 23
A Brief Comparison of Training-Free Multi-Vector Sequence Compression MethodsRohan Jha, Chunsheng Zuo, Reno Kriz et al.
While multi-vector retrieval models outperform single-vector models of comparable size in retrieval quality, their practicality is limited by substantially larger index sizes, driven by the additional sequence-length dimension in their document embeddings. Because document embedding size dictates both memory overhead and query latency, compression is essential for deployment. In this work, we present an evaluation of training-free methods targeting the token sequence length, a dimension unique to multi-vector retrieval. Our findings suggest that token merging is strictly superior to token pruning for reducing index size while maintaining retrieval effectiveness.
58.3IRMay 1
A Replicability Study of XTRRohan Jha, Reno Kriz, Benjamin Van Durme
The XTR (conteXtual Token Retrieval) algorithm is a modification to ColBERT retrieval that avoids the costly step of fully gathering and reranking the candidates' embeddings by imputing their missing similarity scores from the initial token retrieval step. The original work proposes a modified training objective as necessary for effective XTR retrieval, arguing that standard ColBERT token scoring is unsuitable for imputation. In this paper, we replicate both the XTR retrieval algorithm and its modified training objective, and extend the evaluation to knowledge-distillation (KD) training and efficient retrieval engines (PLAID and WARP). We confirm the token-level matching characteristics claimed in the original work, but fail to replicate XTR's overall effectiveness advantage over ColBERT under a controlled comparison. We further show that XTR's training modification has a concrete mechanistic consequence for modern retrieval engines: by flattening ColBERT's characteristically peaked token score distribution, XTR training yields more discriminative centroid scores and thus more efficient IVF-based retrieval under PLAID and WARP. The utility of XTR training is therefore not limited to the low-$k'$ regime originally studied, but extends to any deployment setting where IVF-based engines are used. These findings offer practitioners concrete guidance on how and when to use XTR as their multi-vector retriever.
77.4LGApr 30
Diagnosing Capability Gaps in Fine-Tuning DataSaeid Asgari Taghanaki, Rakshanda Agarwal, Bruce Sun et al.
Fine-tuning large language models (LLMs) for domain-specific tasks requires training datasets that comprehensively cover the target capabilities a practitioner needs. Yet identifying which capabilities a dataset fails to support, and doing so before an expensive fine-tuning run, remains a largely unsolved problem. We introduce GoalCover, a framework that helps practitioners systematically detect capability gaps in fine-tuning datasets through interactive goal decomposition and automated coverage assessment. GoalCover guides a practitioner through structured decomposition of a high-level goal into atomic, independently evaluable subgoals; assigns each training sample an LLM-based alignment score against every subgoal; and surfaces missing capabilities through automated analysis of low-scoring sample explanations. We validate the framework along two complementary axes. First, through controlled corruption experiments across three domains (medical QA, legal summarization, code generation), we show that GoalCover reliably distinguishes targeted from non-targeted capability impacts: target subgoals degrade by 25.6% on average versus 2.1% for non-target subgoals (Cohen's d=1.24). Second, we demonstrate downstream utility on a financial-summarization Reinforcement Fine-Tuning (RFT) task with Qwen-3-14B: training on GoalCover-filtered data improves the LLM-judge reward from 3.77 to 4.12 (out of 5) over the unfiltered baseline, and combining filtered data with goal-conditioned synthetic samples yields the strongest result (4.20). The two results together show that GoalCover works as a practical pre-fine-tuning diagnostic: it detects capability gaps and produces concrete signal for closing them.
CLApr 30, 2020
Does Data Augmentation Improve Generalization in NLP?Rohan Jha, Charles Lovering, Ellie Pavlick
Neural models often exploit superficial features to achieve good performance, rather than deriving more general features. Overcoming this tendency is a central challenge in areas such as representation learning and ML fairness. Recent work has proposed using data augmentation, i.e., generating training examples where the superficial features fail, as a means of encouraging models to prefer the stronger features. We design a series of toy learning problems to test the hypothesis that data augmentation leads models to unlearn weaker heuristics, but not to learn stronger features in their place. We find partial support for this hypothesis: Data augmentation often hurts before it helps, and it is less effective when the preferred strong feature is much more difficult to extract than the competing weak feature.