Distill or Annotate? Cost-Efficient Fine-Tuning of Compact Models
This addresses the cost and environmental concerns of NLP practitioners by providing a practical strategy for budget allocation in model development, though it is incremental as it builds on existing distillation and annotation methods.
The paper tackles the problem of efficiently building compact models under a fixed budget by comparing knowledge distillation from a large model to annotating more data for direct training, showing that distillation is almost always more cost-efficient across six diverse tasks, with specific results like distilling from T5-XXL to T5-Small outperforming annotation-based training.
Fine-tuning large models is highly effective, however, inference can be expensive and produces carbon emissions. Knowledge distillation has been shown to be a practical solution to reduce inference costs, but the distillation process itself requires significant computational resources. Rather than buying or renting GPUs to fine-tune, then distill a large model, an NLP practitioner might instead choose to allocate the available budget to hire annotators and manually label additional fine-tuning data. In this paper, we investigate how to most efficiently use a fixed budget to build a compact model. Through extensive experiments on six diverse tasks, we show that distilling from T5-XXL (11B) to T5-Small (60M) is almost always a cost-efficient strategy compared to annotating more data to directly train a compact model (T5-Small). We further investigate how the optimal budget allocated towards computation varies across scenarios. We will make our code, datasets, annotation cost estimates, and baseline models available as a benchmark to support further work on cost-efficient training of compact models.