Few Dimensions are Enough: Fine-tuning BERT with Selected Dimensions Revealed Its Redundant Nature
This work addresses the problem of model efficiency and interpretability for NLP practitioners by revealing BERT's redundancy, though it is incremental as it builds on existing fine-tuning practices.
The study investigated the redundancy in BERT models during fine-tuning on GLUE tasks, finding that most tasks require only 2-3 dimensions and that outputs beyond the CLS vector contain equivalent information, with hidden layers changing significantly and the model having excessive dimensions.
When fine-tuning BERT models for specific tasks, it is common to select part of the final layer's output and input it into a newly created fully connected layer. However, it remains unclear which part of the final layer should be selected and what information each dimension of the layers holds. In this study, we comprehensively investigated the effectiveness and redundancy of token vectors, layers, and dimensions through BERT fine-tuning on GLUE tasks. The results showed that outputs other than the CLS vector in the final layer contain equivalent information, most tasks require only 2-3 dimensions, and while the contribution of lower layers decreases, there is little difference among higher layers. We also evaluated the impact of freezing pre-trained layers and conducted cross-fine-tuning, where fine-tuning is applied sequentially to different tasks. The findings suggest that hidden layers may change significantly during fine-tuning, BERT has considerable redundancy, enabling it to handle multiple tasks simultaneously, and its number of dimensions may be excessive.