ASCLLGSDMLApr 18, 2019

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

arXiv:1904.08779v34046 citations
Originality Highly original
AI Analysis

This addresses the need for better speech recognition accuracy for users in applications like transcription, with incremental improvements over prior hybrid systems.

The paper tackles the problem of improving automatic speech recognition by introducing SpecAugment, a simple data augmentation method applied to feature inputs, achieving state-of-the-art performance with 6.8% WER on LibriSpeech test-other without a language model and 7.2%/14.6% WER on Switchboard/CallHome.

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Code Implementations30 repos

Data from Papers with Code (CC-BY-SA-4.0)

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes