ASCLLGSDAug 11, 2020

Transformer with Bidirectional Decoder for Speech Recognition

arXiv:2008.04481v15 citations
AI Analysis

This work addresses a bottleneck in end-to-end automatic speech recognition by improving accuracy for applications like transcription, though it is incremental.

The authors tackled the problem of conventional transformer-based speech recognition models only using left-to-right contexts by introducing a bidirectional speech transformer that simultaneously utilizes left-to-right and right-to-left contexts, achieving a 3.6% relative CER reduction over the baseline and a CER of 6.64% on the test set.

Attention-based models have made tremendous progress on end-to-end automatic speech recognition(ASR) recently. However, the conventional transformer-based approaches usually generate the sequence results token by token from left to right, leaving the right-to-left contexts unexploited. In this work, we introduce a bidirectional speech transformer to utilize the different directional contexts simultaneously. Specifically, the outputs of our proposed transformer include a left-to-right target, and a right-to-left target. In inference stage, we use the introduced bidirectional beam search method, which can not only generate left-to-right candidates but also generate right-to-left candidates, and determine the best hypothesis by the score. To demonstrate our proposed speech transformer with a bidirectional decoder(STBD), we conduct extensive experiments on the AISHELL-1 dataset. The results of experiments show that STBD achieves a 3.6\% relative CER reduction(CERR) over the unidirectional speech transformer baseline. Besides, the strongest model in this paper called STBD-Big can achieve 6.64\% CER on the test set, without language model rescoring and any extra data augmentation strategies.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes