SD CL ASMar 29, 2022

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

arXiv:2203.15206v37.15 citationsh-index: 12Has Code

Originality Incremental advance

AI Analysis

This work addresses the problem of limited global context in streaming ASR for real-time speech recognition applications, offering an incremental improvement over existing chunk-wise methods.

The authors tackled the challenge of improving global context modeling in streaming Transformer-based ASR while maintaining linear computational complexity, proposing a shifted chunk mechanism that achieved CERs of 6.43% and 5.77% on AISHELL-1 with SChunk-Transformer and SChunk-Conformer, respectively.

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of them have limitations in aspects of linear computational complexity, global context modeling, and parallel training. In this work, we aim to build a model to take all these three advantages for streaming Transformer ASR. Particularly, we propose a shifted chunk mechanism for the chunk-wise Transformer which provides cross-chunk connections between chunks. Therefore, the global context modeling ability of chunk-wise models can be significantly enhanced while all the original merits inherited. We integrate this scheme with the chunk-wise Transformer and Conformer, and identify them as SChunk-Transformer and SChunk-Conformer, respectively. Experiments on AISHELL-1 show that the SChunk-Transformer and SChunk-Conformer can respectively achieve CER 6.43% and 5.77%. And the linear complexity makes them possible to train with large batches and infer more efficiently. Our models can significantly outperform their conventional chunk-wise counterparts, while being competitive, with only 0.22 absolute CER drop, when compared with U2 which has quadratic complexity. A better CER can be achieved if compared with existing chunk-wise or memory-based methods, such as HS-DACS and MMA. Code is released.

View on arXiv PDF Code

Similar