SD CL LG ASJul 2, 2025

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

Marc-André Carbonneau, Benjamin van Niekerk, Hugo Seuté, Jean-Philippe Letendre, Herman Kamper, Julian Zaïdi

arXiv:2507.02176v12 citationsh-index: 29Has Code13th edition of the Speech Synthesis Workshop

Originality Incremental advance

AI Analysis

This work addresses the challenge of speaker identity consistency for voice cloning systems, though it is incremental by focusing on dynamic aspects.

The paper tackled the problem of assessing speaker similarity in speech synthesis by analyzing automatic speaker verification embeddings, finding they focus on static features like timbre and neglect dynamic elements such as rhythm, and proposed U3D to evaluate dynamic rhythm patterns, with results showing improved characterization of identity.

Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

View on arXiv PDF

Similar