CVMay 22, 2025

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

arXiv:2505.16991v21 citationsh-index: 21INTERSPEECH
Originality Highly original
AI Analysis

This work addresses the practical challenge of deploying efficient ASR models on low-resource devices, offering a novel method that reduces performance degradation compared to existing approaches.

The paper tackles the problem of deploying large automatic speech recognition models on low-resource devices by introducing a two-step representation learning approach that produces small models from a single large model, achieving a three-fold training speed-up and up to 12.54% word error rate improvement.

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes