CLSDASMay 14, 2024

SpeechVerse: A Large-scale Generalizable Audio Language Model

Amazon
arXiv:2405.08295v380 citationsh-index: 18
Originality Incremental advance
AI Analysis

This work addresses the need for more versatile audio-language models that can handle diverse tasks through natural language instructions, representing a significant but incremental advance in multimodal AI.

The authors tackled the problem of limited generalizability in audio-language models by developing SpeechVerse, a multi-task training framework that achieved superior performance to task-specific baselines on 9 out of 11 speech processing tasks.

Large language models (LLMs) have shown incredible proficiency in performing tasks that require semantic understanding of natural language instructions. Recently, many works have further expanded this capability to perceive multimodal audio and text inputs, but their capabilities are often limited to specific fine-tuned tasks such as automatic speech recognition and translation. We therefore develop SpeechVerse, a robust multi-task training and curriculum learning framework that combines pre-trained speech and text foundation models via a small set of learnable parameters, while keeping the pre-trained models frozen during training. The models are instruction finetuned using continuous latent representations extracted from the speech foundation model to achieve optimal zero-shot performance on a diverse range of speech processing tasks using natural language instructions. We perform extensive benchmarking that includes comparing our model performance against traditional baselines across several datasets and tasks. Furthermore, we evaluate the model's capability for generalized instruction following by testing on out-of-domain datasets, novel prompts, and unseen tasks. Our empirical experiments reveal that our multi-task SpeechVerse model is even superior to conventional task-specific baselines on 9 out of the 11 tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes