e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations
This work addresses data quality issues in a multimodal reasoning dataset for researchers, but it is incremental as it builds on existing resources.
The authors tackled errors in the SNLI-VE dataset for visual-textual entailment by correcting labels and adding human-written explanations, resulting in a new dataset (e-SNLI-VE) that improved model performance, with specific gains reported in re-evaluations.
The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning. However, the automatic way in which SNLI-VE has been assembled (via combining parts of two related datasets) gives rise to a large number of errors in the labels of this corpus. In this paper, we first present a data collection effort to correct the class with the highest error rate in SNLI-VE. Secondly, we re-evaluate an existing model on the corrected corpus, which we call SNLI-VE-2.0, and provide a quantitative comparison with its performance on the non-corrected corpus. Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE-2.0. Finally, we train models that learn from these explanations at training time, and output such explanations at testing time.