SD AI LGJun 2, 2025

SALF-MOS: Speaker Agnostic Latent Features Downsampled for MOS Prediction

Saurabh Agrawal, Raj Gohil, Gopal Kumar Agrawal, Vikram C M, Kushal Verma

arXiv:2506.02082v14.01 citationsh-index: 7SPCOM

Originality Incremental advance

AI Analysis

This addresses the need for efficient and reliable speech quality evaluation in TTS and voice conversion, reducing manual effort compared to subjective methods, though it appears incremental as it builds on existing MOS prediction approaches.

The paper tackles the problem of speech quality assessment for text-to-speech synthesis by developing SALF-MOS, a speaker-agnostic model that predicts Mean Opinion Scores, achieving state-of-the-art results based on metrics like MSE, LCC, SRCC, and KTAU.

Speech quality assessment is a critical process in selecting text-to-speech synthesis (TTS) or voice conversion models. Evaluation of voice synthesis can be done using objective metrics or subjective metrics. Although there are many objective metrics like the Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA) or Short-Time Objective Intelligibility (STOI) but none of them is feasible in selecting the best model. On the other hand subjective metric like Mean Opinion Score is highly reliable but it requires a lot of manual efforts and are time-consuming. To counter the issues in MOS Evaluation, we have developed a novel model, Speaker Agnostic Latent Features (SALF)-Mean Opinion Score (MOS) which is a small-sized, end-to-end, highly generalized and scalable model for predicting MOS score on a scale of 5. We use the sequences of convolutions and stack them to get the latent features of the audio samples to get the best state-of-the-art results based on mean squared error (MSE), Linear Concordance Correlation coefficient (LCC), Spearman Rank Correlation Coefficient (SRCC) and Kendall Rank Correlation Coefficient (KTAU).

View on arXiv PDF

Similar