CLJul 17, 2018

Hierarchical Multitask Learning for CTC-based Speech Recognition

arXiv:1807.06234v244 citations
Originality Synthesis-oriented
AI Analysis

This work addresses speech recognition accuracy for telephone conversational speech, but it is incremental as it extends existing hierarchical multitask learning methods to the CTC framework.

The paper investigates hierarchical multitask learning for CTC-based speech recognition, finding that adding an auxiliary phone loss at an intermediate layer improves performance on telephone conversational speech recognition, with the best results combining this approach with pretraining to achieve a 3.4% absolute reduction in word error rates on the Eval2000 test sets.

Previous work has shown that neural encoder-decoder speech recognition can be improved with hierarchical multitask learning, where auxiliary tasks are added at intermediate layers of a deep encoder. We explore the effect of hierarchical multitask learning in the context of connectionist temporal classification (CTC)-based speech recognition, and investigate several aspects of this approach. Consistent with previous work, we observe performance improvements on telephone conversational speech recognition (specifically the Eval2000 test sets) when training a subword-level CTC model with an auxiliary phone loss at an intermediate layer. We analyze the effects of a number of experimental variables (like interpolation constant and position of the auxiliary loss function), performance in lower-resource settings, and the relationship between pretraining and multitask learning. We observe that the hierarchical multitask approach improves over standard multitask training in our higher-data experiments, while in the low-resource settings standard multitask training works well. The best results are obtained by combining hierarchical multitask learning and pretraining, which improves word error rates by 3.4% absolute on the Eval2000 test sets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes