MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models
For researchers developing speech quality assessment models, this work highlights the overlooked problem of out-of-domain generalization and provides a benchmark to evaluate it.
This paper introduces MOS-Bench, a collection of 8 training and 17 test sets for subjective speech quality assessment, and systematically evaluates out-of-domain generalization. They find that pooling multiple training sets is a simple yet effective solution, with data variation being more important than data size for robust generalization.
In this paper, we study the task of subjective speech quality assessment (SSQA), which refers to predicting the perceptual quality of speech. Owing to the development of deep neural network models, SSQA has greatly advanced and has been widely applied in scientific papers to evaluate speech generation systems. Nonetheless, the insufficient out-of-domain (OOD) generalization ability of current SSQA models is underexplored and often overlooked by researchers. To study this problem systematically, we present MOS-Bench, a diverse SSQA dataset collection that currently contains 8 training sets and 17 test sets. Through extensive experiments, we first highlight the OOD generalization challenges of existing models. We then evaluate the efficacy of multiple-dataset training, comparing straightforward data pooling against AlignNet, an existing domain-aware method. We demonstrate that pooling multiple training sets provides a simple yet effective solution, and variation in the data is a key factor for robust generalization beyond training data size.