CL LG SDJul 16, 2017

Listening while Speaking: Speech Chain by Deep Learning

Andros Tjandra, Sakriani Sakti, Satoshi Nakamura

arXiv:1707.04879v113.7174 citations

Originality Incremental advance

AI Analysis

This addresses the problem of disjointed speech processing systems for researchers and practitioners in speech technology, offering a novel integration approach that is incremental in combining existing ASR and TTS methods.

The paper tackled the independent development of automatic speech recognition (ASR) and text-to-speech synthesis (TTS) by proposing a closed-loop speech chain model based on deep learning, which integrates both systems to mimic human auditory feedback, resulting in significant performance improvements over separate systems trained only on labeled data.

Despite the close relationship between speech perception and production, research in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influence on each other. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker's mouth to her ear is crucial. In this paper, we take a step further and develop a closed-loop speech chain model based on deep learning. The sequence-to-sequence model in close-loop architecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS attempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text transcription given the synthesized speech. To the best of our knowledge, this is the first deep learning model that integrates human speech perception and production behaviors. Our experimental results show that the proposed approach significantly improved the performance more than separate systems that were only trained with labeled data.

View on arXiv PDF

Similar