AS CLOct 29, 2022

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

arXiv:2210.16663v240.0296 citationsh-index: 83

Originality Highly original

AI Analysis

This work addresses speech recognition challenges for applications requiring robustness to diverse speaking styles and languages, representing a novel method rather than an incremental improvement.

The paper tackles the problem of end-to-end speech recognition by introducing BERT-CTC, which adapts BERT for connectionist temporal classification to relax conditional independence assumptions and incorporate linguistic knowledge, resulting in improvements over conventional approaches across variations in speaking styles and languages.

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.

View on arXiv PDF

Similar