ASCLSDOct 12, 2021

Word Order Does Not Matter For Speech Recognition

arXiv:2110.05994v22 citations
Originality Incremental advance
AI Analysis

This work addresses a weakly supervised speech recognition challenge for audio processing, but it is incremental as it builds on existing methods to handle unordered transcripts.

The paper tackles the problem of training automatic speech recognition systems with weakly supervised data where word order in transcripts is unknown, achieving 2.3%/4.6% word error rates on LibriSpeech test sets, closely matching supervised baseline performance.

In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the pseudo-labels generated from this model on the training set, we then train a letter-based acoustic model using Connectionist Temporal Classification loss. Our system achieves 2.3%/4.6% on test-clean/test-other subsets of LibriSpeech, which closely matches with the supervised baseline's performance.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes