CLSDASAug 2, 2021

Decoupling recognition and transcription in Mandarin ASR

arXiv:2108.01129v114 citations
Originality Incremental advance
AI Analysis

This addresses the problem of efficient Mandarin ASR for users in speech technology applications, representing an incremental improvement with a specific gain.

The paper tackled the challenge of Mandarin automatic speech recognition by decoupling recognition and transcription into audio-to-Pinyin and Pinyin-to-Hanzi tasks, achieving a 3.9% character error rate on the Aishell-1 corpus, which is the best reported result on this dataset.

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio -> Hanzi into two sub-tasks: (1) audio -> Pinyin and (2) Pinyin -> Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio -> Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes