CL SD ASMar 29, 2022

Improving Generalization of Deep Neural Network Acoustic Models with Length Perturbation and N-best Based Label Smoothing

Xiaodong Cui, George Saon, Tohru Nagano, Masayuki Suzuki, Takashi Fukuda, Brian Kingsbury, Gakuto Kurata

IBM

arXiv:2203.15176v10.88 citationsh-index: 52

Originality Incremental advance

AI Analysis

This work addresses overfitting in speech recognition models, offering incremental improvements for ASR systems.

The authors tackled overfitting in deep neural network acoustic models for automatic speech recognition by introducing length perturbation and n-best based label smoothing, which improved generalization and achieved state-of-the-art performance on the SWB300 dataset.

We introduce two techniques, length perturbation and n-best based label smoothing, to improve generalization of deep neural network (DNN) acoustic models for automatic speech recognition (ASR). Length perturbation is a data augmentation algorithm that randomly drops and inserts frames of an utterance to alter the length of the speech feature sequence. N-best based label smoothing randomly injects noise to ground truth labels during training in order to avoid overfitting, where the noisy labels are generated from n-best hypotheses. We evaluate these two techniques extensively on the 300-hour Switchboard (SWB300) dataset and an in-house 500-hour Japanese (JPN500) dataset using recurrent neural network transducer (RNNT) acoustic models for ASR. We show that both techniques improve the generalization of RNNT models individually and they can also be complementary. In particular, they yield good improvements over a strong SWB300 baseline and give state-of-art performance on SWB300 using RNNT models.

View on arXiv PDF

Similar