Unpacking the Resilience of SNLI Contradiction Examples to Attacks
This work addresses the problem of spurious correlations in NLI models for NLP researchers, but it is incremental as it builds on existing adversarial attack methods.
The study investigated the vulnerability of pre-trained models on SNLI and MultiNLI benchmarks to adversarial attacks, finding that the contradiction class was more resilient with a smaller accuracy drop compared to entailment and neutral classes, and fine-tuning on adversarial examples restored performance to near-baseline levels.
Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true language understanding remains uncertain. Models trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model's vulnerabilities. Our analysis revealed substantial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near-baseline levels for both the standard and challenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks.