CLApr 23

Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages

Srija Anand, Ashwin Sankar, Ishvinder Sethi, Aaditya Pareek, Kartik Rajput, Gaurav Yadav, Nikhil Narasimhan, Adish Pandya, Deepon Halder, Mohammed Safi Ur Rahman Khan, Praveen S, Shobhit Banga

arXiv:2604.2148137.2h-index: 41

Predicted impact top 14% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This work provides a scalable, linguistically controlled evaluation methodology for multilingual TTS, addressing the high variance in crowdsourced assessments for diverse languages.

The authors developed a controlled multidimensional pairwise evaluation framework for multilingual TTS, collecting over 120K pairwise comparisons across 10 Indic languages to construct a leaderboard and analyze perceptual trade-offs.

Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidimensional pairwise evaluation framework for multilingual TTS that combines linguistic control with perceptually grounded annotation. Using 5K+ native and code-mixed sentences across 10 Indic languages, we evaluate 7 state-of-the-art TTS systems and collect over 120K pairwise comparisons from over 1900 native raters. In addition to overall preference, raters provide judgments across 6 perceptual dimensions: intelligibility, expressiveness, voice quality, liveliness, noise, and hallucinations. Using Bradley-Terry modeling, we construct a multilingual leaderboard, interpret human preference using SHAP analysis and analyze leaderboard reliability alongside model strengths and trade-offs across perceptual dimensions.

View on arXiv PDF

Similar