Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset
This work addresses the problem of real-time speech recognition for applications requiring low latency, though it is incremental as it builds on existing Transformer and streaming techniques.
The authors tackled the high computational cost of Transformer models in speech recognition by developing a streamable Transformer Transducer that combines Transformer-XL and chunk-wise processing, achieving superior performance over RNN Transducer and other models in streaming scenarios with optimized runtime and latency using a small look-ahead.
Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model. We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario. Furthermore, the runtime cost and latency can be optimized with a relatively small look-ahead.