Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis
This work addresses the challenge of speaker identity consistency for voice cloning systems, though it is incremental by focusing on dynamic aspects.
The paper tackled the problem of assessing speaker similarity in speech synthesis by analyzing automatic speaker verification embeddings, finding they focus on static features like timbre and neglect dynamic elements such as rhythm, and proposed U3D to evaluate dynamic rhythm patterns, with results showing improved characterization of identity.
Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.