CLNov 10, 2022

BERT on a Data Diet: Finding Important Examples by Gradient-Based Pruning

arXiv:2211.05610v222 citationsh-index: 37
Originality Synthesis-oriented
AI Analysis

This work addresses the issue of data inefficiency for NLP practitioners, but it is incremental as it adapts existing methods from vision to NLP.

The paper tackles the problem of inefficient training of pre-trained language models by applying gradient-based pruning metrics (GraNd and EL2N) to NLP for the first time, showing that pruning a small portion of high-scoring examples can preserve or even surpass test accuracy.

Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes