LG HCMay 20, 2023

Efficient Multimodal Neural Networks for Trigger-less Voice Assistants

Sai Srujana Buddi, Utkarsh Oggy Sarawgi, Tashweena Heeramun, Karan Sawnhey, Ed Yanosik, Saravana Rathinam, Saurabh Adya

arXiv:2305.12063v16.65 citations

Originality Incremental advance

AI Analysis

This work addresses the need for more efficient and deployable multimodal systems for voice assistants on low-power devices like smartwatches, representing an incremental improvement over existing methods.

The paper tackled the problem of trigger-less voice assistant invocation on smartwatches by proposing a neural network-based audio-gesture multimodal fusion system, which improves adaptability, scalability, and reduces human biases compared to heuristic-based methods.

The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.

View on arXiv PDF

Similar