CL LG SD ASNov 9, 2019

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

Zhong Meng, Yashesh Gaur, Jinyu Li, Yifan Gong

arXiv:1911.03762v11.638 citationsh-index: 54

Originality Incremental advance

AI Analysis

This work addresses the problem of adapting speech recognition models to individual speakers with minimal data, which is incremental as it builds on existing attention-based encoder-decoder frameworks.

The paper tackles speaker adaptation for end-to-end speech recognition with limited data, proposing three regularization methods that achieve up to 12.2% and 3.0% word error rate improvements over a speaker-independent model in supervised and unsupervised settings.

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.

View on arXiv PDF

Similar