CLASOct 23, 2019

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

arXiv:1910.10387v254 citations
Originality Incremental advance
AI Analysis

This work addresses speech recognition accuracy and convergence speed for researchers and practitioners, but it is incremental as it adapts an existing pretraining method (XLNet) to the speech domain.

The paper tackled the problem of improving self-attention networks (SANs) for speech recognition by introducing Speech-XLNet, an unsupervised pretraining scheme, which resulted in relative improvements of 11.9% on TIMIT and 8.3% on WSJ, achieving a state-of-the-art phone error rate of 13.3% on TIMIT.

Self-attention network (SAN) can benefit significantly from the bi-directional representation learning through unsupervised pretraining paradigms such as BERT and XLNet. In this paper, we present an XLNet-like pretraining scheme "Speech-XLNet" for unsupervised acoustic model pretraining to learn speech representations with SAN. The pretrained SAN is finetuned under the hybrid SAN/HMM framework. We conjecture that by shuffling the speech frame orders, the permutation in Speech-XLNet serves as a strong regularizer to encourage the SAN to make inferences by focusing on global structures through its attention weights. In addition, Speech-XLNet also allows the model to explore the bi-directional contexts for effective speech representation learning. Experiments on TIMIT and WSJ demonstrate that Speech-XLNet greatly improves the SAN/HMM performance in terms of both convergence speed and recognition accuracy compared to the one trained from randomly initialized weights. Our best systems achieve a relative improvement of 11.9% and 8.3% on the TIMIT and WSJ tasks respectively. In particular, the best system achieves a phone error rate (PER) of 13.3% on the TIMIT test set, which to our best knowledge, is the lowest PER obtained from a single system.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes