CLSDASMay 20, 2021

A Streaming End-to-End Framework For Spoken Language Understanding

arXiv:2105.10042v411 citations
Originality Incremental advance
AI Analysis

This work addresses the limitation of existing end-to-end SLU systems that handle only one intention at a time, improving efficiency for users interacting with dialogue systems, though it is incremental as it builds on existing CTC-based methods.

The paper tackles the problem of processing multiple user intentions in spoken language understanding by proposing a streaming end-to-end framework that identifies intentions sequentially in an online and incremental manner, achieving about 97% intent detection accuracy on multi-intent settings in the Fluent Speech Commands dataset.

End-to-end spoken language understanding (SLU) has recently attracted increasing interest. Compared to the conventional tandem-based approach that combines speech recognition and language understanding as separate modules, the new approach extracts users' intentions directly from the speech signals, resulting in joint optimization and low latency. Such an approach, however, is typically designed to process one intention at a time, which leads users to take multiple rounds to fulfill their requirements while interacting with a dialogue system. In this paper, we propose a streaming end-to-end framework that can process multiple intentions in an online and incremental way. The backbone of our framework is a unidirectional RNN trained with the connectionist temporal classification (CTC) criterion. By this design, an intention can be identified when sufficient evidence has been accumulated, and multiple intentions can be identified sequentially. We evaluate our solution on the Fluent Speech Commands (FSC) dataset and the intent detection accuracy is about 97 % on all multi-intent settings. This result is comparable to the performance of the state-of-the-art non-streaming models, but is achieved in an online and incremental way. We also employ our model to a keyword spotting task using the Google Speech Commands dataset and the results are also highly promising.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes