AI LG SD ASMay 24, 2025

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

Pooneh Mousavi, Shubham Gupta, Cem Subakan, Mirco Ravanelli

arXiv:2505.18517v13.31 citationsh-index: 31INTERSPEECH

Originality Incremental advance

AI Analysis

This work addresses the problem of adapting LLMs to speech and audio tasks for researchers and practitioners, offering a simplified and efficient approach that is incremental in improving multitask learning.

The paper tackles the challenge of adapting large language models (LLMs) to general-purpose audio-language tasks by introducing LiSTEN, a framework that uses dynamic prompt selection with learnable key-value pairs, achieving competitive performance with fewer trainable parameters and reduced dependence on large datasets.

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

View on arXiv PDF

Similar