CLLGSDJun 15, 2016

Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks

arXiv:1606.05007v12 citations
Originality Incremental advance
AI Analysis

This addresses pronunciation dictionary creation challenges for ASR systems, though it appears incremental as it builds on existing semi-supervised DNN approaches.

The paper tackles the problem of suboptimal phonemic units in ASR by proposing a data-driven method that jointly estimates sub-word units and dictionaries from orthographic transcriptions, showing it largely outperforms phoneme-based continuous speech recognition on the TIMIT dataset.

Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs. However they are not the optimal choice due to several reasons such as: large amount of effort required to handcraft a pronunciation dictionary, pronunciation variations, human mistakes and under-resourced dialects and languages. Here, we propose a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary. Experimental results show that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes