CLLGApr 29, 2020

Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning

arXiv:2004.14074v11026 citations
Originality Incremental advance
AI Analysis

This addresses the problem of inefficient fine-tuning for NLP practitioners by offering a more stable and data-efficient approach, though it is incremental as it builds on existing pre-training methods.

The paper tackles the sub-optimality of fine-tuning pre-trained transformers with random classifiers by introducing a scoring method that uses the pre-trained masked language modeling head for commonsense reasoning tasks, achieving up to 80% test accuracy on COPA and more stable training with less data.

Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. Most of the existing approaches rely on a randomly initialized classifier on top of such networks. We argue that this fine-tuning procedure is sub-optimal as the pre-trained model has no prior on the specific classifier labels, while it might have already learned an intrinsic textual representation of the task. In this paper, we introduce a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase. We study commonsense reasoning tasks where the model must rank a set of hypotheses given a premise, focusing on the COPA, Swag, HellaSwag and CommonsenseQA datasets. By exploiting our scoring method without fine-tuning, we are able to produce strong baselines (e.g. 80% test accuracy on COPA) that are comparable to supervised approaches. Moreover, when fine-tuning directly on the proposed scoring function, we show that our method provides a much more stable training phase across random restarts (e.g $\times 10$ standard deviation reduction on COPA test accuracy) and requires less annotated data than the standard classifier approach to reach equivalent performances.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes