ASCLLGSDFeb 14, 2020

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

arXiv:2002.06165v139 citations
AI Analysis

This addresses speaker variability in ASR systems, offering an unsupervised adaptation method that improves performance in multi-speaker scenarios without needing auxiliary systems at test time, though it is incremental as it builds on existing i-vector and attention mechanisms.

The paper tackles speaker adaptation in end-to-end ASR by proposing an unsupervised method using an attention-based memory block to store and retrieve speaker i-vectors, achieving similar word error rates to i-vectors for single speakers and significantly lower WERs for utterances with speaker changes.

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes