CLJun 29, 2023

Probabilistic Linguistic Knowledge and Token-level Text Augmentation

arXiv:2306.16644v20.51 citationsh-index: 4

Originality Synthesis-oriented

AI Analysis

This work challenges the assumed benefits of token-level text augmentation for NLP practitioners, showing incremental negative results.

The paper investigated the effectiveness of token-level text augmentation techniques and probabilistic linguistic knowledge in a linguistically-motivated evaluation, finding that five common techniques (e.g., synonym replacement) were generally ineffective across models and languages, with minimal impact from linguistic knowledge.

This paper investigates the effectiveness of token-level text augmentation and the role of probabilistic linguistic knowledge within a linguistically-motivated evaluation context. Two text augmentation programs, REDA and REDA$_{NG}$, were developed, both implementing five token-level text editing operations: Synonym Replacement (SR), Random Swap (RS), Random Insertion (RI), Random Deletion (RD), and Random Mix (RM). REDA$_{NG}$ leverages pretrained $n$-gram language models to select the most likely augmented texts from REDA's output. Comprehensive and fine-grained experiments were conducted on a binary question matching classification task in both Chinese and English. The results strongly refute the general effectiveness of the five token-level text augmentation techniques under investigation, whether applied together or separately, and irrespective of various common classification model types used, including transformers. Furthermore, the role of probabilistic linguistic knowledge is found to be minimal.

View on arXiv PDF

Similar