AS SDJun 18, 2020

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, Haizhou Li

arXiv:2006.10414v115.255 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of recognizing mixed-language speech for ASR systems, representing an incremental improvement with specific gains.

The paper tackles the problem of automatic speech recognition for code-switching speech, where speakers alternate between languages, by proposing a multi-encoder-decoder Transformer architecture with language-specific encoders and attention mechanisms, achieving relative error rate reductions of 10.2% and 10.8% on evaluation sets.

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve the acoustic representation of each language. These representations are combined using a language-specific multi-head attention mechanism in the decoder module. Each encoder and its corresponding attention module in the decoder are pre-trained using a large monolingual corpus aiming to alleviate the impact of limited CS training data. We call such a network a multi-encoder-decoder (MED) architecture. Experiments on the SEAME corpus show that the proposed MED architecture achieves 10.2% and 10.8% relative error rate reduction on the CS evaluation sets with Mandarin and English as the matrix language respectively.

View on arXiv PDF

Similar