ASCLSDAug 25, 2023

Decoupled Structure for Improved Adaptability of End-to-End Models

arXiv:2308.13345v17 citationsh-index: 64
Originality Incremental advance
AI Analysis

This addresses the domain adaptation challenge for speech recognition systems, enabling improved performance in new domains without retraining, though it is incremental as it builds on existing E2E models.

The paper tackled the problem of domain shifts in end-to-end automatic speech recognition models by proposing decoupled structures for attention-based encoder-decoder and neural transducer models, which allowed flexible domain adaptation using text-only data and achieved relative word error rate reductions of 15.1% and 17.2% on target-domain corpora while maintaining intra-domain performance.

Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great success by jointly learning acoustic and linguistic information, it still suffers from the effect of domain shifts, thus limiting potential applications. The E2E ASR model implicitly learns an internal language model (LM) which characterises the training distribution of the source domain, and the E2E trainable nature makes the internal LM difficult to adapt to the target domain with text-only data To solve this problem, this paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models, which can achieve flexible domain adaptation in both offline and online scenarios while maintaining robust intra-domain performance. To this end, the acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component (i.e. internal LM) replaceable. When encountering a domain shift, the internal LM can be directly replaced during inference by a target-domain LM, without re-training or using domain-specific paired speech-text data. Experiments for E2E ASR models trained on the LibriSpeech-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions on the TED-LIUM 2 and AESRC2020 corpora while still maintaining performance on intra-domain data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes