ASCLLGSDMLApr 2, 2020

Towards Relevance and Sequence Modeling in Language Recognition

arXiv:2004.01221v115 citations
Originality Incremental advance
AI Analysis

This addresses the problem of robust language recognition for speech processing systems in noisy environments, representing an incremental advance over prior sequence-based models.

The paper tackles the challenge of automatic language identification in noisy, multi-dialect scenarios by proposing a neural network framework that uses short-sequence information and relevance weighting via BLSTM with attention. It achieves significant improvements over conventional methods on noisy NIST LRE 2017 and RATS datasets.

The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The conventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical summary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes