VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording
This addresses the problem of handling long-form speech data efficiently for ASR applications, though it appears incremental as it builds on existing MoChA and CTC methods.
The paper tackled streaming automatic speech recognition on unsegmented long-form recordings without voice activity detection, achieving robust recognition for up to a few hours with comparable accuracy to label-synchronous decoding.
In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.