ASSDJun 25, 2020

Streaming Transformer ASR with Blockwise Synchronous Beam Search

arXiv:2006.14941v421 citations
AI Analysis

This work addresses the need for low-latency, real-time speech recognition in applications like voice assistants and transcription services, offering an incremental improvement over existing streaming methods.

The paper tackles the problem of enabling Transformer-based automatic speech recognition (ASR) to operate in a streaming fashion, which traditionally requires the entire input sequence, by proposing a blockwise synchronous beam search algorithm with block boundary detection and reliability scoring. The result is that the proposed streaming Transformer ASR outperforms conventional online approaches like MoChA, achieves comparable or superior performance to batch models and other streaming methods across multiple language tasks, and reduces response time.

The Transformer self-attention network has shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the entire input sequence is required to compute both self-attention and source--target attention. In this paper, we propose a novel blockwise synchronous beam search algorithm based on blockwise processing of encoder to perform streaming E2E Transformer ASR. In the beam search, encoded feature blocks are synchronously aligned using a block boundary detection technique, where a reliability score of each predicted hypothesis is evaluated based on the end-of-sequence and repeated tokens in the hypothesis. Evaluations of the HKUST and AISHELL-1 Mandarin, LibriSpeech English, and CSJ Japanese tasks show that the proposed streaming Transformer algorithm outperforms conventional online approaches, including monotonic chunkwise attention (MoChA), especially when using the knowledge distillation technique. An ablation study indicates that our streaming approach contributes to reducing the response time, and the repetition criterion contributes significantly in certain tasks. Our streaming ASR models achieve comparable or superior performance to batch models and other streaming-based Transformer methods in all tasks considered.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes