SDLGASApr 6, 2021

Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model

arXiv:2104.02558v116 citations
Originality Synthesis-oriented
AI Analysis

This work addresses performance gaps in speech recognition for low-resource or out-of-domain data, but it is incremental as it compares existing methods on new datasets.

The study tackled the problem of adapting wav2vec 2.0 acoustic models for automatic speech recognition with limited training data, comparing CTC and LFMMI fine-tuning in out-of-domain and cross-lingual scenarios, resulting in relative WER improvements of up to 64% over supervised baselines.

In this work, we investigate if the wav2vec 2.0 self-supervised pretraining helps mitigate the overfitting issues with connectionist temporal classification (CTC) training to reduce its performance gap with flat-start lattice-free MMI (E2E-LFMMI) for automatic speech recognition with limited training data. Towards that objective, we use the pretrained wav2vec 2.0 BASE model and fine-tune it on three different datasets including out-of-domain (Switchboard) and cross-lingual (Babel) scenarios. Our results show that for supervised adaptation of the wav2vec 2.0 model, both E2E-LFMMI and CTC achieve similar results; significantly outperforming the baselines trained only with supervised data. Fine-tuning the wav2vec 2.0 model with E2E-LFMMI and CTC we obtain the following relative WER improvements over the supervised baseline trained with E2E-LFMMI. We get relative improvements of 40% and 44% on the clean-set and 64% and 58% on the test set of Librispeech (100h) respectively. On Switchboard (300h) we obtain relative improvements of 33% and 35% respectively. Finally, for Babel languages, we obtain relative improvements of 26% and 23% on Swahili (38h) and 18% and 17% on Tagalog (84h) respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes