ASCLLGSDMay 31, 2021

Low-Resource Spoken Language Identification Using Self-Attentive Pooling and Deep 1D Time-Channel Separable Convolutions

arXiv:2106.00052v15 citations
Originality Incremental advance
AI Analysis

This work addresses the problem of language identification for multilingual automated speech recognition systems in low-resource settings, particularly for languages like those in Russia, and is incremental as it builds on existing methods with specific architectural improvements.

The authors tackled low-resource spoken language identification by proposing a convolutional neural network with a self-attentive pooling layer, achieving state-of-the-art results on the Low Resource ASR challenge dataset. They also hypothesized that confusion matrices in diverse datasets reflect language similarity measures.

This memo describes NTR/TSU winning submission for Low Resource ASR challenge at Dialog2021 conference, language identification track. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. Traditionally, the ASR task requires large volumes of labeled data that are unattainable for most of the world's languages, including most of the languages of Russia. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer shows promising results in low-resource setting for the language identification task and set up a SOTA for the Low Resource ASR challenge dataset. Additionally, we compare the structure of confusion matrices for this and significantly more diverse VoxForge dataset and state and substantiate the hypothesis that whenever the dataset is diverse enough so that the other classification factors, like gender, age etc. are well-averaged, the confusion matrix for LID system bears the language similarity measure.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes