LG AI CL MLMay 14, 2019

Misleading Failures of Partial-input Baselines

Shi Feng, Eric Wallace, Jordan Boyd-Graber

arXiv:1905.05778v351.61123 citations

Originality Incremental advance

AI Analysis

This work cautions against over-reliance on partial-input baselines for dataset verification, highlighting a critical limitation in NLP evaluation methods.

The paper demonstrates that the failure of partial-input baselines does not guarantee datasets are artifact-free, by designing artificial datasets with full-input patterns and showing a hypothesis-only model with premise patterns solves 15% of previously 'hard' SNLI examples.

Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only models for SNLI or question-only models for VQA). When a partial-input baseline gets high accuracy, a dataset is cheatable. However, the converse is not necessarily true: the failure of a partial-input baseline does not mean a dataset is free of artifacts. To illustrate this, we first design artificial datasets which contain trivial patterns in the full input that are undetectable by any partial-input model. Next, we identify such artifacts in the SNLI dataset - a hypothesis-only model augmented with trivial patterns in the premise can solve 15% of the examples that are previously considered "hard". Our work provides a caveat for the use of partial-input baselines for dataset verification and creation.

View on arXiv PDF

Similar