Entropy Meets Importance: A Unified Head Importance-Entropy Score for Stable and Efficient Transformer Pruning
This addresses efficiency challenges in deploying transformer models for NLP tasks, offering a more stable and accurate pruning method, though it is incremental as it builds on existing gradient-based approaches.
The paper tackled the problem of pruning transformer models efficiently without sacrificing accuracy or stability by introducing a new pruning criterion, HIES, which integrates head importance scores with attention entropy. The result was up to 15.2% improvement in model quality and 2.04x improvement in stability over existing methods.
Transformer-based models have achieved remarkable performance in NLP tasks. However, their structural characteristics-multiple layers and attention heads-introduce efficiency challenges in inference and deployment. To address these challenges, various pruning methods have recently been proposed. Notably, gradient-based methods using Head Importance Scores (HIS) have gained traction for interpretability, efficiency, and ability to identify redundant heads. However, HIS alone has limitations as it captures only the gradient-driven contribution, overlooking the diversity of attention patterns. To overcome these limitations, we introduce a novel pruning criterion, HIES (Head Importance-Entropy Score), which integrates head importance scores with attention entropy, providing complementary evidence on per-head contribution. Empirically, HIES-based pruning yields up to 15.2% improvement in model quality and 2.04x improvement in stability over HIS-only methods, enabling substantial model compression without sacrificing either accuracy or stability. Code will be released upon publication.