ASLGSDFeb 20, 2019

Utterance-level end-to-end language identification using attention-based CNN-BLSTM

arXiv:1902.07374v151 citations
Originality Incremental advance
AI Analysis

This work addresses language identification for speech processing applications, presenting an incremental improvement over existing neural network approaches.

The paper tackles language identification from speech by proposing an end-to-end attention-based CNN-BLSTM model that processes variable-length utterances directly, achieving comparable error reduction to state-of-the-art methods on NIST LRE07 tasks for 3, 10, and 30-second durations.

In this paper, we present an end-to-end language identification framework, the attention-based Convolutional Neural Network-Bidirectional Long-short Term Memory (CNN-BLSTM). The model is performed on the utterance level, which means the utterance-level decision can be directly obtained from the output of the neural network. To handle speech utterances with entire arbitrary and potentially long duration, we combine CNN-BLSTM model with a self-attentive pooling layer together. The front-end CNN-BLSTM module plays a role as local pattern extractor for the variable-length inputs, and the following self-attentive pooling layer is built on top to get the fixed-dimensional utterance-level representation. We conducted experiments on NIST LRE07 closed-set task, and the results reveal that the proposed attention-based CNN-BLSTM model achieves comparable error reduction with other state-of-the-art utterance-level neural network approaches for all 3 seconds, 10 seconds, 30 seconds duration tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes