SD AI ASMay 24, 2023

Iteratively Improving Speech Recognition and Voice Conversion

Mayank Kumar Singh, Naoya Takahashi, Onoe Naoyuki

arXiv:2305.15055v12.3

Originality Incremental advance

AI Analysis

This work addresses the problem of limited data for speech technologies in specific domains like singing and Hindi speech, offering an incremental improvement through iterative training.

The paper tackles the challenge of training high-quality automatic speech recognition (ASR) and voice conversion (VC) models in low-data resource domains by proposing an iterative framework that improves both models through mutual enhancement, showing experimental improvements over baselines in English singing and Hindi speech evaluations.

Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.

View on arXiv PDF

Similar