CLOct 17, 2017

Embedding-Based Speaker Adaptive Training of Deep Neural Networks

Xiaodong Cui, Vaibhava Goel, George Saon

arXiv:1710.06937v16.841 citations

Originality Incremental advance

AI Analysis

This work addresses speaker adaptation in speech recognition, offering a novel method that improves accuracy for users of such systems, though it is incremental relative to existing adaptation techniques.

The paper tackled speaker variability in deep neural network acoustic modeling by proposing an embedding-based speaker adaptive training approach that uses speaker embeddings to adapt internal features, achieving superior performance over i-vector methods on large vocabulary continuous speech recognition tasks.

An embedding-based speaker adaptive training (SAT) approach is proposed and investigated in this paper for deep neural network acoustic modeling. In this approach, speaker embedding vectors, which are a constant given a particular speaker, are mapped through a control network to layer-dependent element-wise affine transformations to canonicalize the internal feature representations at the output of hidden layers of a main network. The control network for generating the speaker-dependent mappings is jointly estimated with the main network for the overall speaker adaptive acoustic modeling. Experiments on large vocabulary continuous speech recognition (LVCSR) tasks show that the proposed SAT scheme can yield superior performance over the widely-used speaker-aware training using i-vectors with speaker-adapted input features.

View on arXiv PDF

Similar