ASAISDSep 24, 2025

Selective Classifier-free Guidance for Zero-shot Text-to-speech

arXiv:2509.19668v1h-index: 10
Originality Incremental advance
AI Analysis

This addresses a specific problem in speech synthesis for applications requiring high-quality, personalized voice generation, but it is incremental as it builds on existing CFG methods from image generation.

The paper tackled the challenge of balancing speaker fidelity and text adherence in zero-shot text-to-speech by adapting classifier-free guidance (CFG) strategies from image generation, finding that standard CFG generally fails but a selective CFG approach applied in later timesteps improves speaker similarity while limiting text degradation.

In zero-shot text-to-speech, achieving a balance between fidelity to the target speaker and adherence to text content remains a challenge. While classifier-free guidance (CFG) strategies have shown promising results in image generation, their application to speech synthesis are underexplored. Separating the conditions used for CFG enables trade-offs between different desired characteristics in speech synthesis. In this paper, we evaluate the adaptability of CFG strategies originally developed for image generation to speech synthesis and extend separated-condition CFG approaches for this domain. Our results show that CFG strategies effective in image generation generally fail to improve speech synthesis. We also find that we can improve speaker similarity while limiting degradation of text adherence by applying standard CFG during early timesteps and switching to selective CFG only in later timesteps. Surprisingly, we observe that the effectiveness of a selective CFG strategy is highly text-representation dependent, as differences between the two languages of English and Mandarin can lead to different results even with the same model.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes