AIDec 16, 2022

Speech Aware Dialog System Technology Challenge (DSTC11)

Hagen Soltau, Izhak Shafran, Mingqiu Wang, Abhinav Rastogi, Jeffrey Zhao, Ye Jia, Wei Han, Yuan Cao, Aramys Miranda

DeepMind

arXiv:2212.08704v114.712 citationsh-index: 34

Originality Synthesis-oriented

AI Analysis

This addresses the lack of public corpora for speech-aware dialog systems, enabling research to bridge the gap between written and spoken input for practical applications, though it is incremental as it builds on existing written-domain tasks.

The paper tackled the problem of task-oriented dialog systems using speech input by creating a public corpus with three spoken versions of the MultiWoz task, including TTS-Verbatim, Human-Verbatim, and Human-paraphrased, to investigate the performance gap between written and spoken forms and assess TTS as a surrogate for human data collection.

Most research on task oriented dialog modeling is based on written text input. However, users interact with practical dialog systems often using speech as input. Typically, systems convert speech into text using an Automatic Speech Recognition (ASR) system, introducing errors. Furthermore, these systems do not address the differences in written and spoken language. The research on this topic is stymied by the lack of a public corpus. Motivated by these considerations, our goal in hosting the speech-aware dialog state tracking challenge was to create a public corpus or task which can be used to investigate the performance gap between the written and spoken forms of input, develop models that could alleviate this gap, and establish whether Text-to-Speech-based (TTS) systems is a reasonable surrogate to the more-labor intensive human data collection. We created three spoken versions of the popular written-domain MultiWoz task -- (a) TTS-Verbatim: written user inputs were converted into speech waveforms using a TTS system, (b) Human-Verbatim: humans spoke the user inputs verbatim, and (c) Human-paraphrased: humans paraphrased the user inputs. Additionally, we provided different forms of ASR output to encourage wider participation from teams that may not have access to state-of-the-art ASR systems. These included ASR transcripts, word time stamps, and latent representations of the audio (audio encoder outputs). In this paper, we describe the corpus, report results from participating teams, provide preliminary analyses of their results, and summarize the current state-of-the-art in this domain.

View on arXiv PDF

Similar