SDCVASJul 10, 2025

Input Conditioned Layer Dropping in Speech Foundation Models

arXiv:2507.07954v11 citationsh-index: 21MLSP
Originality Incremental advance
AI Analysis

This work addresses the need for dynamic, computationally efficient speech models in resource-constrained environments, representing an incremental improvement over existing layer dropping techniques.

The paper tackled the problem of adapting speech foundation models for edge and IoT settings with varying computational resources by proposing an input-driven layer dropping method, which outperformed random dropping and achieved results on-par or better than early exit across 4 speech and audio benchmarks.

Curating foundation speech models for edge and IoT settings, where computational resources vary over time, requires dynamic architectures featuring adaptable reduction strategies. One emerging approach is layer dropping ($\mathcal{LD}$) which skips fraction of the layers of a backbone network during inference to reduce the computational load. This allows transforming static models into dynamic ones. However, existing approaches exhibit limitations either in the mode of selecting layers or by significantly modifying the neural architecture. To this end, we propose input-driven $\mathcal{LD}$ that employs the network's input features and a lightweight layer selecting network to determine the optimum combination of processing layers. Extensive experimentation on 4 speech and audio public benchmarks, using two different pre-trained foundation models, demonstrates the effectiveness of our approach, thoroughly outperforming random dropping and producing on-par (or better) results to early exit.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes