CL LGDec 5, 2018

Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs

arXiv:1812.04718v1129 citations

Originality Synthesis-oriented

AI Analysis

This addresses the 'Big Data Wall' challenge for minority language communities, organizations, and companies competing with large tech firms, but it is incremental as it applies existing augmentation ideas to text data.

The paper tackled the problem of insufficient text data for training deep neural networks by experimenting with various text augmentation techniques, including innovative ones using NLP Cloud APIs, and found that these techniques increased accuracy by 4.3% to 21.6% on a text polarity prediction task with an amplification factor of 5.

In practice, it is common to find oneself with far too little text data to train a deep neural network. This "Big Data Wall" represents a challenge for minority language communities on the Internet, organizations, laboratories and companies that compete the GAFAM (Google, Amazon, Facebook, Apple, Microsoft). While most of the research effort in text data augmentation aims on the long-term goal of finding end-to-end learning solutions, which is equivalent to "using neural networks to feed neural networks", this engineering work focuses on the use of practical, robust, scalable and easy-to-implement data augmentation pre-processing techniques similar to those that are successful in computer vision. Several text augmentation techniques have been experimented. Some existing ones have been tested for comparison purposes such as noise injection or the use of regular expressions. Others are modified or improved techniques like lexical replacement. Finally more innovative ones, such as the generation of paraphrases using back-translation or by the transformation of syntactic trees, are based on robust, scalable, and easy-to-use NLP Cloud APIs. All the text augmentation techniques studied, with an amplification factor of only 5, increased the accuracy of the results in a range of 4.3% to 21.6%, with significant statistical fluctuations, on a standardized task of text polarity prediction. Some standard deep neural network architectures were tested: the multilayer perceptron (MLP), the long short-term memory recurrent network (LSTM) and the bidirectional LSTM (biLSTM). Classical XGBoost algorithm has been tested with up to 2.5% improvements.

View on arXiv PDF

Similar