SDAIASFeb 11, 2025

LoRP-TTS: Low-Rank Personalized Text-To-Speech

arXiv:2502.07562v13 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating diverse speech corpora for speech-related tasks, representing an incremental improvement over existing zero-shot methods.

The paper tackled the problem of zero-shot text-to-speech systems struggling with non-studio-quality speech samples by using Low-Rank Adaptation (LoRA) to enable personalization from single noisy recordings, resulting in up to 30 percentage points improvement in speaker similarity while maintaining content and naturalness.

Speech synthesis models convert written text into natural-sounding audio. While earlier models were limited to a single speaker, recent advancements have led to the development of zero-shot systems that generate realistic speech from a wide range of speakers using their voices as additional prompts. However, they still struggle with imitating non-studio-quality samples that differ significantly from the training datasets. In this work, we demonstrate that utilizing Low-Rank Adaptation (LoRA) allows us to successfully use even single recordings of spontaneous speech in noisy environments as prompts. This approach enhances speaker similarity by up to $30pp$ while preserving content and naturalness. It represents a significant step toward creating truly diverse speech corpora, that is crucial in all speech-related tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes