CLLGSDASNov 9, 2019

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

arXiv:1911.03762v138 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of adapting speech recognition models to individual speakers with minimal data, which is incremental as it builds on existing attention-based encoder-decoder frameworks.

The paper tackles speaker adaptation for end-to-end speech recognition with limited data, proposing three regularization methods that achieve up to 12.2% and 3.0% word error rate improvements over a speaker-independent model in supervised and unsupervised settings.

We propose three regularization-based speaker adaptation approaches to adapt the attention-based encoder-decoder (AED) model with very limited adaptation data from target speakers for end-to-end automatic speech recognition. The first method is Kullback-Leibler divergence (KLD) regularization, in which the output distribution of a speaker-dependent (SD) AED is forced to be close to that of the speaker-independent (SI) model by adding a KLD regularization to the adaptation criterion. To compensate for the asymmetric deficiency in KLD regularization, an adversarial speaker adaptation (ASA) method is proposed to regularize the deep-feature distribution of the SD AED through the adversarial learning of an auxiliary discriminator and the SD AED. The third approach is the multi-task learning, in which an SD AED is trained to jointly perform the primary task of predicting a large number of output units and an auxiliary task of predicting a small number of output units to alleviate the target sparsity issue. Evaluated on a Microsoft short message dictation task, all three methods are highly effective in adapting the AED model, achieving up to 12.2% and 3.0% word error rate improvement over an SI AED trained from 3400 hours data for supervised and unsupervised adaptation, respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes