CLJun 10, 2019

Multimodal Logical Inference System for Visual-Textual Entailment

Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, Daisuke Bekki

arXiv:1906.03952v131.11092 citations

Originality Incremental advance

AI Analysis

This addresses the challenge of visual-textual entailment for AI systems, but it appears incremental as it builds on existing semantic parsing and theorem proving methods.

The paper tackles the problem of multimodal inference across text and vision by using logic-based representations and an unsupervised system to prove entailment relations, showing it can handle semantically complex sentences.

A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.

View on arXiv PDF

Similar