ASCLLGMLFeb 24, 2022

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

arXiv:2202.12163v415 citations
Originality Incremental advance
AI Analysis

This work addresses language identification for long-form audio, enabling streaming applications, but it is incremental as it builds on existing conformer architectures.

The paper tackles language identification in long-form speech by proposing a conformer-based system with attentive temporal pooling for streaming inference and domain adaptation methods, achieving significant performance improvements over LSTM and transformer models.

In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, we investigate two domain adaptation approaches to allow adapting an existing language identification model without retraining the model parameters for a new domain. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-based models significantly outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation improve model accuracy.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes