CLCVOct 15, 2020

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

arXiv:2010.07526v11011 citations
Originality Incremental advance
AI Analysis

This work addresses the need for intuitive, human-understandable explanations in AI systems for visual-textual reasoning, though it is incremental as it builds on existing methods to enhance interpretability.

The authors tackled the challenge of generating natural language rationales for complex visual reasoning tasks, such as visual commonsense reasoning and visual question answering, by developing Rationale^VT Transformer, which integrates pretrained language models with visual components like object recognition and commonsense graphs, resulting in improved performance through visual adaptation.

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that the base pretrained language model benefits from visual adaptation and that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes