CLJun 10, 2019

Multimodal Logical Inference System for Visual-Textual Entailment

arXiv:1906.03952v11092 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of visual-textual entailment for AI systems, but it appears incremental as it builds on existing semantic parsing and theorem proving methods.

The paper tackles the problem of multimodal inference across text and vision by using logic-based representations and an unsupervised system to prove entailment relations, showing it can handle semantically complex sentences.

A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes