CL LG ASJan 8, 2023

SpeeChain: A Speech Toolkit for Large-Scale Machine Speech Chain

Heli Qi, Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

arXiv:2301.02966v11.33 citationsh-index: 36Has Code

Originality Incremental advance

AI Analysis

This work addresses the need for efficient large-scale speech data augmentation tools for researchers and practitioners in speech recognition, though it is incremental as it builds on existing TTS-to-ASR concepts.

The paper tackles the challenge of scaling the TTS-to-ASR chain for speech processing by introducing SpeeChain, an open-source toolkit that improves word error rate (WER) in semi-supervised settings on the LibriSpeech train_clean_460 dataset.

This paper introduces SpeeChain, an open-source Pytorch-based toolkit designed to develop the machine speech chain for large-scale use. This first release focuses on the TTS-to-ASR chain, a core component of the machine speech chain, that refers to the TTS data augmentation by unspoken text for ASR. To build an efficient pipeline for the large-scale TTS-to-ASR chain, we implement easy-to-use multi-GPU batch-level model inference, multi-dataloader batch generation, and on-the-fly data selection techniques. In this paper, we first explain the overall procedure of the TTS-to-ASR chain and the difficulties of each step. Then, we present a detailed ablation study on different types of unlabeled data, data filtering thresholds, batch composition, and real-synthetic data ratios. Our experimental results on train_clean_460 of LibriSpeech demonstrate that our TTS-to-ASR chain can significantly improve WER in a semi-supervised setting.

View on arXiv PDF

Similar