CLApr 23

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

arXiv:2604.2148137.2h-index: 41
Predicted impact top 14% in CL · last 90 daysOriginality Incremental advance
AI Analysis

This work provides a scalable, linguistically controlled evaluation methodology for multilingual TTS, addressing the high variance in crowdsourced assessments for diverse languages.

The authors developed a controlled multidimensional pairwise evaluation framework for multilingual TTS, collecting over 120K pairwise comparisons across 10 Indic languages to construct a leaderboard and analyze perceptual trade-offs.

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes