AS CL SDOct 12, 2021

Word Order Does Not Matter For Speech Recognition

Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

arXiv:2110.05994v23.32 citations

Originality Incremental advance

AI Analysis

This work addresses a weakly supervised speech recognition challenge for audio processing, but it is incremental as it builds on existing methods to handle unordered transcripts.

The paper tackles the problem of training automatic speech recognition systems with weakly supervised data where word order in transcripts is unknown, achieving 2.3%/4.6% word error rates on LibriSpeech test sets, closely matching supervised baseline performance.

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

View on arXiv PDF

Similar