CLSDASMar 14, 2023

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

NVIDIA
arXiv:2303.07624v145 citationsh-index: 83
Originality Incremental advance
AI Analysis

This work addresses efficiency challenges for deploying speech recognition models in real-world applications, though it is incremental as it builds on existing Transformer and pruning techniques.

The paper tackles the problem of large footprint and computational overhead in Transformer-based speech recognition models by proposing I3D, a Transformer encoder with input-dependent dynamic depth, which outperforms vanilla Transformers and static pruned models with similar layer counts at inference.

Transformer-based end-to-end speech recognition has achieved great success. However, the large footprint and computational overhead make it difficult to deploy these models in some real-world applications. Model compression techniques can reduce the model size and speed up inference, but the compressed model has a fixed architecture which might be suboptimal. We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. With a similar number of layers at inference time, I3D-based models outperform the vanilla Transformer and the static pruned model via iterative layer pruning. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes