A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models
This addresses speech synthesis problems for Russian language users, but it is incremental as it focuses on dataset creation rather than a new method.
The paper tackles phonetic and prosodic challenges in Russian speech synthesis by introducing the Balalaika dataset, which contains over 2,000 hours of annotated speech, and shows that models trained on it significantly outperform those on existing datasets in synthesis and enhancement tasks.
Russian speech synthesis presents distinctive challenges, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation. This paper introduces Balalaika, a novel dataset comprising more than 2,000 hours of studio-quality Russian speech with comprehensive textual annotations, including punctuation and stress markings. Experimental results show that models trained on Balalaika significantly outperform those trained on existing datasets in both speech synthesis and enhancement tasks. We detail the dataset construction pipeline, annotation methodology, and results of comparative evaluations.