CLSDASNov 27, 2025

Joint Speech and Text Training for LLM-Based End-to-End Spoken Dialogue State Tracking

arXiv:2511.22503v1
Originality Incremental advance
AI Analysis

This addresses the data scarcity issue in spoken DST for dialogue systems, though it is incremental as it builds on existing methods.

The paper tackles the problem of end-to-end spoken dialogue state tracking (DST) by proposing joint training on spoken and textual data to improve cross-domain generalization, achieving good performance without requiring spoken training data from target domains.

End-to-end spoken dialogue state tracking (DST) is made difficult by the tandem of having to handle speech input and data scarcity. Combining speech foundation encoders and large language models has been proposed in recent work as to alleviate some of this difficulty. Although this approach has been shown to result in strong spoken DST models, achieving state-of-the-art performance in realistic multi-turn DST, it struggles to generalize across domains and requires annotated spoken DST training data for each domain of interest. However, collecting such data for every target domain is both costly and difficult. Noting that textual DST data is more easily obtained for various domains, in this work, we propose jointly training on available spoken DST data and written textual data from other domains as a way to achieve cross-domain generalization. We conduct experiments which show the efficacy of our proposed method for getting good cross-domain DST performance without relying on spoken training data from the target domains.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes