SDASApr 16

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

arXiv:2510.1466477.813 citationsh-index: 13Has Code
Predicted impact top 16% in SD · last 90 daysOriginality Highly original
AI Analysis

This work addresses the need for interpretable and generalizable speech quality evaluation, offering a paradigm shift from scalar scores to structured explanations.

The paper introduces SpeechLLM-as-Judges, a new paradigm using LLMs for interpretable speech quality evaluation, and develops SQ-LLM which achieves strong performance across four tasks and multiple languages.

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. The relevant code, models, and data are publicly available at https://github.com/NKU-HLT/SpeechLLM-as-Judges.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes