AS CL SDOct 28, 2022

Random Utterance Concatenation Based Data Augmentation for Improving Short-video Speech Recognition

Yist Y. Lin, Tao Han, Haihua Xu, Van Tung Pham, Yerbolat Khassanov, Tze Yuang Chong, Yi He, Lu Lu, Zejun Ma

arXiv:2210.15876v25.14 citationsh-index: 21

Originality Synthesis-oriented

AI Analysis

This addresses performance degradation in short-video speech recognition due to length mismatches, offering an incremental improvement for this domain-specific task.

The paper tackled the problem of train-test utterance length mismatch in end-to-end automatic speech recognition for short-video speech, proposing a random utterance concatenation data augmentation method that achieved a 5.72% word error rate reduction on average across 15 languages.

One of limitations in end-to-end automatic speech recognition (ASR) framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose an on-the-fly random utterance concatenation (RUC) based data augmentation method to alleviate train-test utterance length mismatch issue for short-video ASR task. Specifically, we are motivated by observations that our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech (~3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer (~10 seconds on average). Such a mismatch can lead to suboptimal performance. Empirically, it's observed the proposed RUC method significantly improves long utterance recognition without performance drop on short one. Overall, it achieves 5.72% word error rate reduction on average for 15 languages and improved robustness to various utterance length.

View on arXiv PDF

Similar