CLASOct 22, 2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

arXiv:2010.11395v3208 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of real-time speech recognition for applications requiring low latency, though it is incremental as it builds on existing Transformer and streaming techniques.

The authors tackled the high computational cost of Transformer models in speech recognition by developing a streamable Transformer Transducer that combines Transformer-XL and chunk-wise processing, achieving superior performance over RNN Transducer and other models in streaming scenarios with optimized runtime and latency using a small look-ahead.

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes