SDCLNov 12, 2025

End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

arXiv:2511.09282v11 citationsh-index: 5
Originality Incremental advance
AI Analysis

This addresses the problem of long-form spoken question answering for applications requiring efficient audio processing, representing an incremental improvement over existing retrieval methods.

The paper tackles the challenge of processing long audio in spoken question answering by proposing CLSR, an end-to-end contrastive language-speech retriever that extracts question-relevant segments, and it surpasses existing methods across four cross-modal retrieval datasets.

Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever shows promising in help preprocessing long-form speech. But the performance of existing speech-related retrievers is lacking. To address this challenge, we propose CLSR, an end-to-end contrastive language-speech retriever that efficiently extracts question-relevant segments from long audio recordings for downstream SQA task. Unlike conventional speech-text contrastive models, CLSR incorporates an intermediate step that converts acoustic features into text-like representations prior to alignment, thereby more effectively bridging the gap between modalities. Experimental results across four cross-modal retrieval datasets demonstrate that CLSR surpasses both end-to-end speech related retrievers and pipeline approaches combining speech recognition with text retrieval, providing a robust foundation for advancing practical long-form SQA applications.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes