AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering

Chun-Yi Kuan, Kai-Wei Chang, Hung-yi Lee

arXiv:2601.14728v15.15 citationsh-index: 12

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation metrics in text-to-audio generation, offering a more fine-grained and scalable approach for researchers and practitioners, though it is incremental as it builds on existing large language model capabilities.

The paper tackles the problem of evaluating semantic alignment in text-to-audio generation by introducing AQAScore, a framework that uses audio-aware large language models to assess alignment via probabilistic semantic verification, achieving higher correlation with human judgments than existing metrics.

Although text-to-audio generation has made remarkable progress in realism and diversity, the development of evaluation metrics has not kept pace. Widely-adopted approaches, typically based on embedding similarity like CLAPScore, effectively measure general relevance but remain limited in fine-grained semantic alignment and compositional reasoning. To address this, we introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models (ALLMs). AQAScore reformulates assessment as a probabilistic semantic verification task; rather than relying on open-ended text generation, it estimates alignment by computing the exact log-probability of a "Yes" answer to targeted semantic queries. We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks. Experimental results show that AQAScore consistently achieves higher correlation with human judgments than similarity-based metrics and generative prompting baselines, showing its effectiveness in capturing subtle semantic inconsistencies and scaling with the capability of underlying ALLMs.

View on arXiv PDF

Similar