CLSep 11, 2024

Enhancing adversarial robustness in Natural Language Inference using explanations

arXiv:2409.07423v225 citationsh-index: 29
AI Analysis

This addresses the susceptibility of NLI models to adversarial attacks, offering a resource-efficient solution for enhancing robustness in NLP applications.

The paper tackles the problem of adversarial robustness in Natural Language Inference (NLI) by using natural language explanations as a model-agnostic defense strategy, achieving improved robustness compared to baselines through fine-tuning a classifier on explanations rather than premise-hypothesis inputs.

The surge of state-of-the-art Transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy of testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes