SDAIJan 29

Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation

arXiv:2601.21386v1Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of scalable and cost-efficient synthetic speech quality evaluation for researchers and developers, but it is incremental as it builds on existing Fréchet Distance methods.

The paper tackled the challenge of objectively evaluating synthetic speech quality by comprehensively assessing Fréchet Speech Distance (FSD) and Speech Maximum Mean Discrepancy (SMMD) under varied conditions, finding that WavLM Base+ features provide the most stable alignment with human ratings, though these metrics cannot fully replace subjective evaluation.

Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fréchet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fréchet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes