Misleading Failures of Partial-input Baselines
This work cautions against over-reliance on partial-input baselines for dataset verification, highlighting a critical limitation in NLP evaluation methods.
The paper demonstrates that the failure of partial-input baselines does not guarantee datasets are artifact-free, by designing artificial datasets with full-input patterns and showing a hypothesis-only model with premise patterns solves 15% of previously 'hard' SNLI examples.
Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only models for SNLI or question-only models for VQA). When a partial-input baseline gets high accuracy, a dataset is cheatable. However, the converse is not necessarily true: the failure of a partial-input baseline does not mean a dataset is free of artifacts. To illustrate this, we first design artificial datasets which contain trivial patterns in the full input that are undetectable by any partial-input model. Next, we identify such artifacts in the SNLI dataset - a hypothesis-only model augmented with trivial patterns in the premise can solve 15% of the examples that are previously considered "hard". Our work provides a caveat for the use of partial-input baselines for dataset verification and creation.