LoRP-TTS: Low-Rank Personalized Text-To-Speech
This work addresses the challenge of creating diverse speech corpora for speech-related tasks, representing an incremental improvement over existing zero-shot methods.
The paper tackled the problem of zero-shot text-to-speech systems struggling with non-studio-quality speech samples by using Low-Rank Adaptation (LoRA) to enable personalization from single noisy recordings, resulting in up to 30 percentage points improvement in speaker similarity while maintaining content and naturalness.
Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.