Investigating Biases in Textual Entailment Datasets
This addresses data quality issues for NLP researchers, but is incremental as it builds on known biases in crowdsourced datasets.
The paper investigated biases in textual entailment datasets, finding that classification on just the hypotheses in SNLI achieved 64% accuracy, and proposed a method to reduce these biases.
The ability to understand logical relationships between sentences is an important task in language understanding. To aid in progress for this task, researchers have collected datasets for machine learning and evaluation of current systems. However, like in the crowdsourced Visual Question Answering (VQA) task, some biases in the data inevitably occur. In our experiments, we find that performing classification on just the hypotheses on the SNLI dataset yields an accuracy of 64%. We analyze the bias extent in the SNLI and the MultiNLI dataset, discuss its implication, and propose a simple method to reduce the biases in the datasets.