AS CL SDSep 24, 2018

From Audio to Semantics: Approaches to end-to-end spoken language understanding

Parisa Haghani, Arun Narayanan, Michiel Bacchiani, Galen Chuang, Neeraj Gaur, Pedro Moreno, Rohit Prabhavalkar, Zhongdi Qu, Austin Waters

arXiv:1809.09190v126.7159 citations

Originality Incremental advance

AI Analysis

This work addresses the challenge of improving accuracy in spoken language understanding systems for applications like voice assistants, though it is incremental as it builds on existing encoder-decoder methods.

The paper tackles the problem of spoken language understanding by proposing an end-to-end sequence-to-sequence approach that jointly optimizes speech recognition and natural language understanding modules, resulting in an 18% relative improvement in argument word error rate compared to independently trained models.

Conventional spoken language understanding systems consist of two main components: an automatic speech recognition module that converts audio to a transcript, and a natural language understanding module that transforms the resulting text (or top N hypotheses) into a set of domains, intents, and arguments. These modules are typically optimized independently. In this paper, we formulate audio to semantic understanding as a sequence-to-sequence problem [1]. We propose and compare various encoder-decoder based approaches that optimize both modules jointly, in an end-to-end manner. Evaluations on a real-world task show that 1) having an intermediate text representation is crucial for the quality of the predicted semantics, especially the intent arguments and 2) jointly optimizing the full system improves overall accuracy of prediction. Compared to independently trained models, our best jointly trained model achieves similar domain and intent prediction F1 scores, but improves argument word error rate by 18% relative.

View on arXiv PDF

Similar