DDOS: A MOS Prediction Framework utilizing Domain Adaptive Pre-training and Distribution of Opinion Scores
This work addresses the time-consuming issue of collecting subjective MOS for speech synthesis systems, offering an incremental improvement in automatic evaluation methods.
The authors tackled the problem of automatically predicting mean opinion scores (MOS) for speech synthesis evaluation by proposing DDOS, a model that uses domain adaptive pre-training on synthetic speech and models opinion score distributions, which outperformed previous works on the BVCC dataset and achieved second place in the Interspeech 2022 VoiceMOS challenge.
Mean opinion score (MOS) is a typical subjective evaluation metric for speech synthesis systems. Since collecting MOS is time-consuming, it would be desirable if there are accurate MOS prediction models for automatic evaluation. In this work, we propose DDOS, a novel MOS prediction model. DDOS utilizes domain adaptive pre-training to further pre-train self-supervised learning models on synthetic speech. And a proposed module is added to model the opinion score distribution of each utterance. With the proposed components, DDOS outperforms previous works on BVCC dataset. And the zero shot transfer result on BC2019 dataset is significantly improved. DDOS also wins second place in Interspeech 2022 VoiceMOS challenge in terms of system-level score.