AILGSDASMay 24, 2025

LiSTEN: Learning Soft Token Embeddings for Neural Audio LLMs

arXiv:2505.18517v11 citationsh-index: 31INTERSPEECH
Originality Incremental advance
AI Analysis

This work addresses the problem of adapting LLMs to speech and audio tasks for researchers and practitioners, offering a simplified and efficient approach that is incremental in improving multitask learning.

The paper tackles the challenge of adapting large language models (LLMs) to general-purpose audio-language tasks by introducing LiSTEN, a framework that uses dynamic prompt selection with learnable key-value pairs, achieving competitive performance with fewer trainable parameters and reduced dependence on large datasets.

Foundation models based on large language models (LLMs) have shown great success in handling various tasks and modalities. However, adapting these models for general-purpose audio-language tasks is challenging due to differences in acoustic environments and task variations. In this work, we introduce LiSTEN Learning Soft Token Embeddings for Neural Audio LLMs), a framework for adapting LLMs to speech and audio tasks. LiSTEN uses a dynamic prompt selection strategy with learnable key-value pairs, allowing the model to balance general and task-specific knowledge while avoiding overfitting in a multitask setting. Our approach reduces dependence on large-scale ASR or captioning datasets, achieves competitive performance with fewer trainable parameters, and simplifies training by using a single-stage process. Additionally, LiSTEN enhances interpretability by analyzing the diversity and overlap of selected prompts across different tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes