CLJan 25, 2024

No More Distractions: an Adaptive Up-Sampling Algorithm to Reduce Data Artifacts

arXiv:2401.13907v1

Originality Incremental advance

AI Analysis

This addresses generalization issues in NLP benchmarks for researchers and practitioners, though it is incremental as it focuses on a specific dataset artifact.

The paper tackled the problem of language models learning spurious correlations from data artifacts in the SNLI dataset, proposing an adaptive up-sampling algorithm that corrected these artifacts without human intervention, resulting in significantly better model performance on both overall and corrected subsets.

Researchers recently found out that sometimes language models achieve high accuracy on benchmark data set, but they can not generalize very well with even little changes to the original data set. This is sometimes due to data artifacts, model is learning the spurious correlation between tokens and labels, instead of the semantics and logic. In this work, we analyzed SNLI data and visualized such spurious correlations. We proposed an adaptive up-sampling algorithm to correct the data artifacts, which is simple and effective, and does not need human edits or annotation. We did an experiment applying the algorithm to fix the data artifacts in SNLI data and the model trained with corrected data performed significantly better than the model trained with raw SNLI data, overall, as well as on the subset we corrected.

View on arXiv PDF

Similar