CLOct 11, 2022

Streaming Punctuation for Long-form Dictation with Transformers

Microsoft
arXiv:2210.05756v26 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses segmentation and punctuation issues in long-form dictation for speech recognition systems, which is an incremental improvement over existing methods.

The paper tackles punctuation and segmentation problems in long-form dictation speech recognition by proposing a streaming approach using dynamic decoding windows, improving segmentation F0.5-score by 13.9% and achieving an average BLEU-score improvement of 0.66 for machine translation.

While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers. Transformer sequence tagging models are effective at capturing long bi-directional context, which is crucial for automatic punctuation. Automatic Speech Recognition (ASR) production systems, however, are constrained by real-time requirements, making it hard to incorporate the right context when making punctuation decisions. In this paper, we propose a streaming approach for punctuation or re-punctuation of ASR output using dynamic decoding windows and measure its impact on punctuation and segmentation accuracy across scenarios. The new system tackles over-segmentation issues, improving segmentation F0.5-score by 13.9%. Streaming punctuation achieves an average BLEU-score improvement of 0.66 for the downstream task of Machine Translation (MT).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes