CLNov 16, 2023

Whispers of Doubt Amidst Echoes of Triumph in NLP Robustness

arXiv:2311.09694v235 citationsh-index: 38
Originality Synthesis-oriented
AI Analysis

This work highlights unresolved robustness problems in NLP for researchers and practitioners, indicating incremental insights into evaluation limitations.

The paper investigates whether larger models resolve NLP robustness issues, finding that scaling does not adequately improve robustness and that current evaluation methods are problematic.

Do larger and more performant models resolve NLP's longstanding robustness issues? We investigate this question using over 20 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) out-of-domain and challenge test sets, (b) behavioral testing with CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all out-of-domain tests provide insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them adequately robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes