SDCLASMay 4, 2025

Probing Audio-Generation Capabilities of Text-Based Language Models

arXiv:2506.00003v11 citationsh-index: 5
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of audio generation from text for AI researchers, but it is incremental as it builds on existing LLM capabilities without introducing a new method.

The study investigated whether text-based large language models (LLMs) can generate audio by prompting them to produce code that creates musical notes, environmental sounds, and human speech, finding that performance deteriorates with increasing audio complexity, as measured by FAD and CLAP scores.

How does textual representation of audio relate to the Large Language Model's (LLMs) learning about the audio world? This research investigates the extent to which LLMs can be prompted to generate audio, despite their primary training in textual data. We employ a three-tier approach, progressively increasing the complexity of audio generation: 1) Musical Notes, 2) Environmental Sounds, and 3) Human Speech. To bridge the gap between text and audio, we leverage code as an intermediary, prompting LLMs to generate code that, when executed, produces the desired audio output. To evaluate the quality and accuracy of the generated audio, we employ FAD and CLAP scores. Our findings reveal that while LLMs can generate basic audio features, their performance deteriorates as the complexity of the audio increases. This suggests that while LLMs possess a latent understanding of the auditory world, their ability to translate this understanding into tangible audio output remains rudimentary. Further research into techniques that can enhance the quality and diversity of LLM-generated audio can lead to an improvement in the performance of text-based LLMs in generating audio.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes