SD ASMar 15

CodecMOS-Accent: A MOS Benchmark of Resynthesized and TTS Speech from Neural Codecs Across English Accents

Wen-Chin Huang, Nicholas Sanders, Erica Cooper

arXiv:2603.1432867.9h-index: 4

AI Analysis

This provides a benchmark for more human-centric evaluation in speech synthesis, focusing on accented speech, but is incremental as it builds on existing MOS datasets.

The researchers tackled the problem of evaluating neural audio codec and TTS models, particularly for accented speech, by creating a benchmark dataset with 4,000 samples and 19,600 annotations, revealing insights such as a perceptual bias when listeners share the speaker's accent.

We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.

View on arXiv PDF

Similar