CL LGApr 25, 2020

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

arXiv:2004.12239v132.81085 citationsHas Code

Originality Highly original

AI Analysis

This addresses the problem of text classification with limited labeled data for researchers and practitioners, representing a strong specific gain rather than a foundational advancement.

The paper tackles semi-supervised text classification by introducing MixText, which uses a data augmentation method called TMix to interpolate text in hidden space and incorporate unlabeled data, resulting in significant performance improvements over state-of-the-art methods, especially with limited supervision.

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data.By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at https://github.com/GT-SALT/MixText.

View on arXiv PDF Code

Similar