SDLGASAug 28, 2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

arXiv:2308.14909v12 citationsh-index: 24
Originality Incremental advance
AI Analysis

This addresses the challenge of personalized speech generation with limited data for target speakers, but it is incremental as it builds on existing transformer-based TTS models.

The paper tackles the problem of out-of-domain generalization in zero-shot multi-speaker text-to-speech by proposing a pruning method for self-attention layers, which improves voice quality and speaker similarity as verified in evaluations.

For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes