Efficient Multimodal Neural Networks for Trigger-less Voice Assistants
This work addresses the need for more efficient and deployable multimodal systems for voice assistants on low-power devices like smartwatches, representing an incremental improvement over existing methods.
The paper tackled the problem of trigger-less voice assistant invocation on smartwatches by proposing a neural network-based audio-gesture multimodal fusion system, which improves adaptability, scalability, and reduces human biases compared to heuristic-based methods.
The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.