SDLGASOct 9, 2021

Streaming on-device detection of device directed speech from voice and touch-based invocation

arXiv:2110.04656v19 citations
Originality Incremental advance
AI Analysis

This work addresses user experience and privacy issues for smart device users by mitigating false triggers, though it is incremental as it builds on existing temporal convolutional networks.

The paper tackles the problem of accidental virtual assistant invocations by proposing an acoustic false-trigger-mitigation approach for on-device detection of device-directed speech from both voice and touch-based invocations, showing that the streaming TCN model reduces runtime peak-memory by 33% compared to LSTM-based approaches while maintaining accuracy.

When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be invoked by the keyword-like speech or accidental button press, which may have implications on user experience and privacy. To this end, we propose an acoustic false-trigger-mitigation (FTM) approach for on-device device-directed speech detection that simultaneously handles the voice-trigger and touch-based invocation. To facilitate the model deployment on-device, we introduce a new streaming decision layer, derived using the notion of temporal convolutional networks (TCN) [1], known for their computational efficiency. To the best of our knowledge, this is the first approach that can detect device-directed speech from more than one invocation type in a streaming fashion. We compare this approach with streaming alternatives based on vanilla Average layer, and canonical LSTMs, and show: (i) that all the models show only a small degradation in accuracy compared with the invocation-specific models, and (ii) that the newly introduced streaming TCN consistently performs better or comparable with the alternatives, while mitigating device undirected speech faster in time, and with (relative) reduction in runtime peak-memory over the LSTM-based approach of 33% vs. 7%, when compared to a non-streaming counterpart.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes