SDLGASDec 6, 2023

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

arXiv:2312.03632v15 citationsh-index: 14
Originality Incremental advance
AI Analysis

This addresses the problem of improving user experience with virtual assistants by enabling device-directed speech detection without trigger phrases, though it is incremental as it builds on existing multimodal and efficient training methods.

The paper tackles the problem of making virtual assistant interactions more natural by eliminating trigger phrases, using a multimodal approach that combines ASR hypotheses and acoustic features with a large language model, achieving lower equal-error-rates with only 80k or fewer training examples.

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a large language model (LLM). In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes