CVNov 26, 2018

Visual Entailment Task for Visually-Grounded Language Learning

arXiv:1811.10582v259 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses the problem of visually-grounded language learning for AI systems by introducing a new task and dataset, though it is incremental as it builds on existing textual entailment and VQA methods.

The authors introduced Visual Entailment (VE), a task where an image serves as the premise instead of text, and created the SNLI-VE dataset from SNLI and Flickr30k. They proposed the Explainable Visual Entailment (EVE) model and evaluated it against VQA-based models on SNLI-VE, providing insights into grounded language understanding.

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes