LGHCMay 20, 2023

Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

arXiv:2305.12063v15 citations
Originality Incremental advance
AI Analysis

This work addresses the need for more efficient and deployable multimodal systems for voice assistants on low-power devices like smartwatches, representing an incremental improvement over existing methods.

The paper tackled the problem of trigger-less voice assistant invocation on smartwatches by proposing a neural network-based audio-gesture multimodal fusion system, which improves adaptability, scalability, and reduces human biases compared to heuristic-based methods.

The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes