Audio-Maestro: Enhancing Large Audio-Language Models with Tool-Augmented Reasoning
This addresses the need for more accurate and interpretable audio understanding in AI systems, though it is incremental as it builds on existing multimodal models.
The paper tackles the problem of limited interpretability and accuracy in large audio-language models by introducing Audio-Maestro, a tool-augmented reasoning framework that improves general audio reasoning performance, with accuracy increases from 67.4% to 72.1% for Gemini-2.5-flash, 58.3% to 62.8% for DeSTA-2.5, and 60.8% to 63.9% for GPT-4o.
Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.