CLOct 5, 2020

Self-training Improves Pre-training for Natural Language Understanding

arXiv:2010.02194v1799 citations
Originality Incremental advance
AI Analysis

This work addresses the need for scalable semi-supervised learning in NLP by enabling self-training without requiring in-domain unlabeled data, though it is incremental as it builds on existing pre-training methods.

The paper tackled the problem of leveraging unlabeled data for natural language understanding by introducing SentAugment, a data augmentation method for self-training that retrieves task-specific sentences from a large web corpus, resulting in improvements of up to 2.6% on text classification benchmarks.

Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes