CLDec 11, 2024

Bilevel Joint Unsupervised and Supervised Training for Automatic Speech Recognition

arXiv:2412.08548v12 citationsh-index: 40IEEE Transactions on Audio, Speech, and Language Processing
Originality Incremental advance
AI Analysis

This addresses the need for more efficient acoustic model training in speech recognition, though it appears incremental as it builds on existing semi-supervised methods.

The paper tackles the problem of disconnected two-stage training in automatic speech recognition by proposing a bilevel joint unsupervised and supervised training framework, which outperforms pre-training and fine-tuning and other semi-supervised techniques on various datasets.

In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised techniques.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes