Visual Entailment Task for Visually-Grounded Language Learning
This work addresses the problem of visually-grounded language learning for AI systems by introducing a new task and dataset, though it is incremental as it builds on existing textual entailment and VQA methods.
The authors introduced Visual Entailment (VE), a task where an image serves as the premise instead of text, and created the SNLI-VE dataset from SNLI and Flickr30k. They proposed the Explainable Visual Entailment (EVE) model and evaluated it against VQA-based models on SNLI-VE, providing insights into grounded language understanding.
We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.