CLJan 17, 2023

Syllable Subword Tokens for Open Vocabulary Speech Recognition in Malayalam

arXiv:2301.06736v1222 citationsh-index: 16
Originality Synthesis-oriented
AI Analysis

This addresses the problem of handling out-of-vocabulary words in ASR for Malayalam speakers, but it is incremental as it applies an existing subword approach with a specific token type to a new language.

The paper tackled the challenge of open vocabulary speech recognition in Malayalam, a morphologically complex language with a huge vocabulary, by using syllable subword tokens instead of words, resulting in reduced lexicon size, lower model memory requirements, and improved word error rate.

In a hybrid automatic speech recognition (ASR) system, a pronunciation lexicon (PL) and a language model (LM) are essential to correctly retrieve spoken word sequences. Being a morphologically complex language, the vocabulary of Malayalam is so huge and it is impossible to build a PL and an LM that cover all diverse word forms. Usage of subword tokens to build PL and LM, and combining them to form words after decoding, enables the recovery of many out of vocabulary words. In this work we investigate the impact of using syllables as subword tokens instead of words in Malayalam ASR, and evaluate the relative improvement in lexicon size, model memory requirement and word error rate.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes