CLAPDec 16, 2022

Multi-Scales Data Augmentation Approach In Natural Language Inference For Artifacts Mitigation And Pre-Trained Model Optimization

arXiv:2212.08756v41 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses dataset artifacts in NLI for NLP researchers, offering incremental improvements through data augmentation techniques.

The paper tackles the problem of dataset artifacts in natural language inference (NLI) by analyzing the SNLI corpus and proposing a multi-scale data augmentation approach, which enhances model robustness and outperforms pre-trained baselines in perturbation testing.

Machine learning models can reach high performance on benchmark natural language processing (NLP) datasets but fail in more challenging settings. We study this issue when a pre-trained model learns dataset artifacts in natural language inference (NLI), the topic of studying the logical relationship between a pair of text sequences. We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference (SNLI) corpus. We study the stylistic pattern of dataset artifacts in the SNLI. To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks: a behavioral testing checklist at the sentence level and lexical synonym criteria at the word level. Specifically, our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes