CLJul 18, 2018

Hierarchical Multi Task Learning With CTC

arXiv:1807.07104v53.553 citations

Originality Incremental advance

AI Analysis

This work addresses the problem of improving speech recognition accuracy with limited training data for researchers and practitioners in the field, representing an incremental advancement over existing methods.

The paper tackles the challenge of learning useful intermediate representations in automatic speech recognition when using high-level target units like words, by introducing hierarchical multi-task training with Connectionist Temporal Classification at different network levels. The result is a model that achieves a 14.0% Word Error Rate on the Eval2000 Switchboard subset without a decoder or language model, outperforming current state-of-the-art acoustic-to-word models.

In Automatic Speech Recognition it is still challenging to learn useful intermediate representations when using high-level (or abstract) target units such as words. For that reason, character or phoneme based systems tend to outperform word-based systems when just few hundreds of hours of training data are being used. In this paper, we first show how hierarchical multi-task training can encourage the formation of useful intermediate representations. We achieve this by performing Connectionist Temporal Classification at different levels of the network with targets of different granularity. Our model thus performs predictions in multiple scales for the same input. On the standard 300h Switchboard training setup, our hierarchical multi-task architecture exhibits improvements over single-task architectures with the same number of parameters. Our model obtains 14.0% Word Error Rate on the Eval2000 Switchboard subset without any decoder or language model, outperforming the current state-of-the-art on acoustic-to-word models.

View on arXiv PDF

Similar