ASAISDMar 28, 2025

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

arXiv:2503.22275v11 citationsh-index: 8ICASSP
Originality Incremental advance
AI Analysis

This work addresses the problem of enabling multimodal audio-text capabilities in LLMs for applications in AI and audio processing, but it is incremental as it builds on existing tokenization and adaptation techniques.

The paper tackled the challenge of integrating audio comprehension and generation into large language models by converting audio into ultra-low bitrate discrete tokens at 0.23 kbps, enabling seamless multimodal integration. The approach achieved competitive results in audio comprehension with state-of-the-art methods, though audio generation performance was poor.

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes