Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation
This work addresses the challenge of knowledge integration in speech-to-speech dialogue systems, which is crucial for applications like virtual assistants, but it is incremental as it adapts an existing text-based method to a new modality.
The paper tackled the problem of incorporating external knowledge into end-to-end speech-to-speech dialogue systems by proposing a novel retrieval-augmented generation framework that directly retrieves textual knowledge from speech queries, resulting in significant performance improvements and higher retrieval efficiency, though it still lags behind state-of-the-art cascaded models.
End-to-end speech-to-speech (S2S) dialogue systems have recently garnered increasing research attention for their lower latency and more natural integration of nonverbal cues such as emotion and speaker identity. However, these systems face key challenges, particularly in incorporating external knowledge, a capability commonly addressed by Retrieval-Augmented Generation (RAG) in text-based large language models (LLMs). The core difficulty lies in the modality gap between input speech and retrieved textual knowledge, which hinders effective integration of information. To address this issue, we propose a novel end-to-end RAG framework that directly retrieves relevant textual knowledge from speech queries. Experimental results demonstrate that our method significantly improves the performance of end-to-end S2S dialogue systems while achieving higher retrieval efficiency. Although the overall performance still lags behind the SOTA cascaded models, our framework offers a promising direction for enhancing knowledge integration in end-to-end S2S systems. Our code and dataset are released.