CL APDec 16, 2022

Multi-Scales Data Augmentation Approach In Natural Language Inference For Artifacts Mitigation And Pre-Trained Model Optimization

arXiv:2212.08756v40.61 citationsh-index: 3

Originality Incremental advance

AI Analysis

This work addresses dataset artifacts in NLI for NLP researchers, offering incremental improvements through data augmentation techniques.

The paper tackles the problem of dataset artifacts in natural language inference (NLI) by analyzing the SNLI corpus and proposing a multi-scale data augmentation approach, which enhances model robustness and outperforms pre-trained baselines in perturbation testing.

Machine learning models can reach high performance on benchmark natural language processing (NLP) datasets but fail in more challenging settings. We study this issue when a pre-trained model learns dataset artifacts in natural language inference (NLI), the topic of studying the logical relationship between a pair of text sequences. We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference (SNLI) corpus. We study the stylistic pattern of dataset artifacts in the SNLI. To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks: a behavioral testing checklist at the sentence level and lexical synonym criteria at the word level. Specifically, our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.

View on arXiv PDF

Similar