Data Augmentation via Dependency Tree Morphing for Low-Resource Languages
This addresses data scarcity issues for low-resource language NLP applications, though it is an incremental adaptation of image processing techniques to text.
The paper tackles the problem of poor neural NLP system performance for low-resource languages by proposing two dependency tree-based text augmentation techniques (crop and rotate), which improve part-of-speech tagging accuracy for most languages in the Universal Dependencies project, especially those with rich case marking systems.
Neural NLP systems achieve high scores in the presence of sizable training dataset. Lack of such datasets leads to poor system performances in the case low-resource languages. We present two simple text augmentation techniques using dependency trees, inspired from image processing. We crop sentences by removing dependency links, and we rotate sentences by moving the tree fragments around the root. We apply these techniques to augment the training sets of low-resource languages in Universal Dependencies project. We implement a character-level sequence tagging model and evaluate the augmented datasets on part-of-speech tagging task. We show that crop and rotate provides improvements over the models trained with non-augmented data for majority of the languages, especially for languages with rich case marking systems.