CLSDASMay 26, 2025

GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

arXiv:2505.19384v1
Originality Highly original
AI Analysis

This work addresses the problem of generating speech for unseen speakers in text-to-speech systems, representing an incremental improvement with a novel encoding strategy.

The paper tackles zero-shot speech synthesis by proposing GSA-TTS, a method that gradually encodes speaking styles from acoustic references, resulting in promising performance for unseen speakers in naturalness, speaker similarity, and intelligibility.

We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability, which stems from its hierarchical structure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes