SDAILGASNov 4, 2023

Generalized zero-shot audio-to-intent classification

MILA
arXiv:2311.02482v14 citationsh-index: 17
AI Analysis

This addresses the challenge of handling unseen intents in audio-only spoken language understanding systems, which is incremental as it builds on existing methods with multimodal enhancements.

The study tackled the problem of limited ability in audio-only spoken language understanding systems to handle unseen intents by proposing a generalized zero-shot audio-to-intent classification framework using few text samples, resulting in accuracy improvements of 2.75% on SLURP and 18.2% on an internal dataset compared to audio-only training.

Spoken language understanding systems using audio-only data are gaining popularity, yet their ability to handle unseen intents remains limited. In this study, we propose a generalized zero-shot audio-to-intent classification framework with only a few sample text sentences per intent. To achieve this, we first train a supervised audio-to-intent classifier by making use of a self-supervised pre-trained model. We then leverage a neural audio synthesizer to create audio embeddings for sample text utterances and perform generalized zero-shot classification on unseen intents using cosine similarity. We also propose a multimodal training strategy that incorporates lexical information into the audio representation to improve zero-shot performance. Our multimodal training approach improves the accuracy of zero-shot intent classification on unseen intents of SLURP by 2.75% and 18.2% for the SLURP and internal goal-oriented dialog datasets, respectively, compared to audio-only training.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes