ASCLSDJul 7, 2021

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

arXiv:2107.03007v25 citations
AI Analysis

This work addresses incremental improvements in speech recognition systems for applications in English and German datasets, focusing on modeling efficiency and accuracy.

The paper tackled improving CTC-CRF based end-to-end speech recognition by exploring wordpiece modeling units and Conformer neural networks, finding that Conformer significantly boosts performance and wordpieces perform comparably to phone-based systems in languages with high grapheme-phoneme correspondence like German.

Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes