CL SD ASSep 26, 2023

Updated Corpora and Benchmarks for Long-Form Speech Recognition

Jennifer Drexler Fox, Desh Raj, Natalie Delworth, Quinn McNamara, Corey Miller, Migüel Jetté

arXiv:2309.15013v13.614 citationsh-index: 15Has Code

Originality Synthesis-oriented

AI Analysis

This work addresses a practical problem for ASR researchers and developers by providing updated benchmarks to study and mitigate train-test mismatch in real-world applications, though it is incremental as it builds on existing corpora and methods.

The authors tackled the mismatch between segmented training data and unsegmented test audio in ASR by re-releasing three corpora with updated transcriptions and alignments for long-form research, and they found that attention-based encoder-decoders are more susceptible to this issue while simple long-form training improves model robustness.

The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora - TED-LIUM 3, Gigapeech, and VoxPopuli-en - with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

View on arXiv PDF Code

Similar