ASAICLMar 29, 2022

WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen Language Models

arXiv:2203.15863v237 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of applying few-shot learning to audio-text tasks, offering a novel framework for speech understanding with potential applications in domains like voice assistants, though it is incremental in extending existing methods to a new modality.

The authors tackled the problem of enabling few-shot learning for spoken language understanding by adapting frozen language models to process audio inputs, achieving better performance than a naive text baseline.

Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks with only a few text examples, without the need for fine-tuning. Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings functioning like the text embeddings of the language model. Interested in exploring the possibility of transferring the few-shot learning ability to the audio-text setting, we propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model. We show that WavPrompt is a few-shot learner that can perform speech understanding tasks better than a naive text baseline. We conduct detailed ablation studies on different components and hyperparameters to empirically identify the best model configuration. In addition, we conduct a non-speech understanding experiment to show WavPrompt can extract more information than just the transcriptions. Code is available at https://github.com/Hertin/WavPrompt

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes