SD LG ASAug 28, 2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

Hyungchan Yoon, Changhwan Kim, Eunwoo Song, Hyun-Wook Yoon, Hong-Goo Kang

arXiv:2308.14909v15.82 citationsh-index: 24

Originality Incremental advance

AI Analysis

This addresses the challenge of personalized speech generation with limited data for target speakers, but it is incremental as it builds on existing transformer-based TTS models.

The paper tackles the problem of out-of-domain generalization in zero-shot multi-speaker text-to-speech by proposing a pruning method for self-attention layers, which improves voice quality and speaker similarity as verified in evaluations.

For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this out-of-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.

View on arXiv PDF

Similar