CLNov 12, 2018

Multi-encoder multi-resolution framework for end-to-end speech recognition

Ruizhi Li, Xiaofei Wang, Sri Harish Mallidi, Takaaki Hori, Shinji Watanabe, Hynek Hermansky

arXiv:1811.04897v11.911 citations

Originality Incremental advance

AI Analysis

This work addresses speech recognition accuracy for applications like transcription, showing incremental improvements over existing joint CTC/Attention models.

The paper tackles the problem of improving end-to-end automatic speech recognition by proposing a Multi-Encoder Multi-Resolution framework based on joint CTC/Attention models, resulting in relative Word Error Rate reductions of 18.0-32.1% and achieving 3.6% WER on the WSJ eval92 test set, which is the best reported for an end-to-end system.

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing both architectures during multi-task training and joint decoding. In this work, we present a novel Multi-Encoder Multi-Resolution (MEMR) framework based on the joint CTC/Attention model. Two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complimentary acoustic information. A hierarchical attention mechanism is then used to combine the encoder-level information. To demonstrate the effectiveness of the proposed model, experiments are conducted on Wall Street Journal (WSJ) and CHiME-4, resulting in relative Word Error Rate (WER) reduction of 18.0-32.1%. Moreover, the proposed MEMR model achieves 3.6% WER in the WSJ eval92 test set, which is the best WER reported for an end-to-end system on this benchmark.

View on arXiv PDF

Similar