ASSDJun 18, 2020

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

arXiv:2006.10414v155 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of recognizing mixed-language speech for ASR systems, representing an incremental improvement with specific gains.

The paper tackles the problem of automatic speech recognition for code-switching speech, where speakers alternate between languages, by proposing a multi-encoder-decoder Transformer architecture with language-specific encoders and attention mechanisms, achieving relative error rate reductions of 10.2% and 10.8% on evaluation sets.

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that improve the acoustic representation of each language. These representations are combined using a language-specific multi-head attention mechanism in the decoder module. Each encoder and its corresponding attention module in the decoder are pre-trained using a large monolingual corpus aiming to alleviate the impact of limited CS training data. We call such a network a multi-encoder-decoder (MED) architecture. Experiments on the SEAME corpus show that the proposed MED architecture achieves 10.2% and 10.8% relative error rate reduction on the CS evaluation sets with Mandarin and English as the matrix language respectively.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes