AIOct 12, 2023

Effects of Human Adversarial and Affable Samples on BERT Generalization

Aparna Elangovan, Jiayuan He, Yuan Li, Karin Verspoor

arXiv:2310.08008v437.5132 citationsh-index: 21

Originality Incremental advance

AI Analysis

This addresses the issue of poor generalization in NLP models for practitioners, but it is incremental as it builds on existing data quality research.

The paper tackles the problem of BERT-based models underperforming in real-world generalization by investigating the impact of training data quality, specifically the proportion of human-adversarial and human-affable samples. It finds that including 10-30% human-adversarial instances improves precision and F1 by up to 20 points in text classification and relation extraction, while human-affable samples may degrade generalization.

BERT-based models have had strong performance on leaderboards, yet have been demonstrably worse in real-world settings requiring generalization. Limited quantities of training data is considered a key impediment to achieving generalizability in machine learning. In this paper, we examine the impact of training data quality, not quantity, on a model's generalizability. We consider two characteristics of training data: the portion of human-adversarial (h-adversarial), i.e., sample pairs with seemingly minor differences but different ground-truth labels, and human-affable (h-affable) training samples, i.e., sample pairs with minor differences but the same ground-truth label. We find that for a fixed size of training samples, as a rule of thumb, having 10-30% h-adversarial instances improves the precision, and therefore F1, by up to 20 points in the tasks of text classification and relation extraction. Increasing h-adversarials beyond this range can result in performance plateaus or even degradation. In contrast, h-affables may not contribute to a model's generalizability and may even degrade generalization performance.

View on arXiv PDF

Similar