Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
This addresses the challenge of efficient speaker adaptation in TTS for applications requiring personalized voice synthesis, though it is incremental as it builds on existing parameter-efficient fine-tuning techniques.
The paper tackles the problem of adapting multi-speaker text-to-speech models to new speakers without degrading performance for existing ones, proposing an adapter-based method that achieves high-quality synthesis with reduced data requirements.
Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.