ASCLOct 8, 2021

Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

arXiv:2110.04109v229 citations
Originality Incremental advance
AI Analysis

This work addresses the abstraction gap in ASR for speech recognition applications, presenting an incremental improvement with a novel training approach.

The paper tackles the challenge of learning word-level representations in end-to-end automatic speech recognition by proposing a hierarchical conditional model with auxiliary CTC losses and multi-granular subword units. It demonstrates improvements over standard CTC-based and other competitive models on LibriSpeech-100h, 960h, and TEDLIUM2 datasets.

In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Experimental results on LibriSpeech-{100h, 960h} and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model and other competitive models from prior work. We further analyze the results to confirm the effectiveness of the intended representation learning with our model.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes