Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers
This work addresses the challenge of deploying large speech models in real-world scenarios by providing practical guidance on compression techniques, though it is incremental as it focuses on evaluation and comparison rather than introducing new methods.
The paper tackled the problem of inconsistent evaluation metrics for compressing self-supervised speech Transformers by conducting a comprehensive study of four common compression methods, showing that each offers distinct advantages in terms of parameter count, multiply-accumulate operations, and real-time factor.
Transformer-based self-supervised models have achieved remarkable success in speech processing, but their large size and high inference cost present significant challenges for real-world deployment. While numerous compression techniques have been proposed, inconsistent evaluation metrics make it difficult to compare their practical effectiveness. In this work, we conduct a comprehensive study of four common compression methods, including weight pruning, head pruning, low-rank approximation, and knowledge distillation on self-supervised speech Transformers. We evaluate each method under three key metrics: parameter count, multiply-accumulate operations, and real-time factor. Results show that each method offers distinct advantages. In addition, we contextualize recent compression techniques, comparing DistilHuBERT, FitHuBERT, LightHuBERT, ARMHuBERT, and STaRHuBERT under the same framework, offering practical guidance on compression for deployment.