Lessons Learned from Applying off-the-shelf BERT: There is no Silver Bullet
This work addresses the challenge of training large NLP models, particularly when GPU hardware is limited, by evaluating the practical effectiveness of pre-trained BERT, but it is incremental as it compares existing methods without introducing new techniques.
The study investigated the use of off-the-shelf BERT models for classification tasks, finding that their complexity and computational cost do not necessarily lead to better performance compared to simpler methods like LSTMs and baselines.
One of the challenges in the NLP field is training large classification models, a task that is both difficult and tedious. It is even harder when GPU hardware is unavailable. The increased availability of pre-trained and off-the-shelf word embeddings, models, and modules aim at easing the process of training large models and achieving a competitive performance. We explore the use of off-the-shelf BERT models and share the results of our experiments and compare their results to those of LSTM networks and more simple baselines. We show that the complexity and computational cost of BERT is not a guarantee for enhanced predictive performance in the classification tasks at hand.