CLLGMLNov 28, 2016

AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech

arXiv:1611.09207v199 citations
Originality Incremental advance
AI Analysis

This provides a non-intrusive tool for developers of text-to-speech synthesizers to assess speech naturalness without human raters, though it is incremental as it builds on existing neural network methods.

The paper tackled the problem of automating the assessment of speech synthesis quality by modeling human raters' mean opinion scores (MOS) using a deep recurrent neural network on raw waveforms, achieving correlations moderately inferior to human ratings and approaching them when averaging multiple utterances.

Developers of text-to-speech synthesizers (TTS) often make use of human raters to assess the quality of synthesized speech. We demonstrate that we can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform. Our best models provide utterance-level estimates of MOS only moderately inferior to sampled human ratings, as shown by Pearson and Spearman correlations. When multiple utterances are scored and averaged, a scenario common in synthesizer quality assessment, AutoMOS achieves correlations approaching those of human raters. The AutoMOS model has a number of applications, such as the ability to explore the parameter space of a speech synthesizer without requiring a human-in-the-loop.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes