Pruning by Block Benefit: Exploring the Properties of Vision Transformer Blocks during Domain Adaptation
This work addresses the computational inefficiency of Vision Transformers for resource-limited hardware in transfer learning scenarios, offering an incremental improvement over existing pruning methods.
The paper tackles the problem of pruning Vision Transformers for domain adaptation, where traditional pruning methods misevaluate weight significance on unseen data, leading to suboptimal resource assignment. The proposed Pruning by Block Benefit (P3B) method achieves state-of-the-art results, conserving high performance with up to 70% parameter reduction while losing only 0.64% in accuracy.
Vision Transformer have set new benchmarks in several tasks, but these models come with the lack of high computational costs which makes them impractical for resource limited hardware. Network pruning reduces the computational complexity by removing less important operations while maintaining performance. However, pruning a model on an unseen data domain, leads to a misevaluation of weight significance, resulting in suboptimal resource assignment. In this work, we find that task-sensitive layers initially fail to improve the feature representation on downstream tasks, leading to performance loss for early pruning decisions. To address this problem, we introduce Pruning by Block Benefit (P3B), a pruning method that utilizes the relative contribution on block level to globally assign parameter resources. P3B identifies low-impact components to reduce parameter allocation while preserving critical ones. Classical pruning mask optimization struggles to reactivate zero-mask-elements. In contrast, P3B sets a layerwise keep ratio based on global performance metrics, ensuring the reactivation of late-converging blocks. We show in extensive experiments that P3B is a state of the art pruning method with most noticeable gains in transfer learning tasks. Notably, P3B is able to conserve high performance, even in high sparsity regimes of 70% parameter reduction while only losing 0.64% in accuracy.