SDAILGJun 2, 2025

SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

arXiv:2506.02082v11 citationsh-index: 7SPCOM
Originality Incremental advance
AI Analysis

This addresses the need for efficient and reliable speech quality evaluation in TTS and voice conversion, reducing manual effort compared to subjective methods, though it appears incremental as it builds on existing MOS prediction approaches.

The paper tackles the problem of speech quality assessment for text-to-speech synthesis by developing SALF-MOS, a speaker-agnostic model that predicts Mean Opinion Scores, achieving state-of-the-art results based on metrics like MSE, LCC, SRCC, and KTAU.

Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes