CLSDASJan 17, 2023

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

arXiv:2301.07087v22 citationsh-index: 30
Originality Incremental advance
AI Analysis

This work addresses the need for accurate and efficient speech quality assessment, particularly in low-resource scenarios, though it is incremental in improving existing neural MOS prediction models.

The authors tackled the problem of predicting listener Mean Opinion Scores for synthesized speech by introducing MooseNet, a trainable metric that combines a self-supervised learning neural network with a Probabilistic Linear Discriminative Analysis module, achieving state-of-the-art results on the VoiceMOS Challenge data.

We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even state-of-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model fine-tuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes