ASCLSDOct 28, 2022

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

arXiv:2210.15876v24 citationsh-index: 21
AI Analysis

This addresses performance degradation in short-video speech recognition due to length mismatches, offering an incremental improvement for this domain-specific task.

The paper tackled the problem of train-test utterance length mismatch in end-to-end automatic speech recognition for short-video speech, proposing a random utterance concatenation data augmentation method that achieved a 5.72% word error rate reduction on average across 15 languages.

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes