CLCVLGAug 14, 2019

Fusion of Detected Objects in Text for Visual Question Answering

arXiv:1908.05054v21094 citationsHas Code
AI Analysis

This addresses the challenge of integrating vision and language for AI systems, representing an incremental improvement in a specific domain.

The paper tackles the problem of multimodal context modeling by introducing the B2T2 architecture for visual question answering, achieving a new state-of-the-art on the Visual Commonsense Reasoning benchmark with a 25% relative reduction in error rate.

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The "Bounding Boxes in Text Transformer" (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (https://visualcommonsense.com), achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided (https://github.com/google-research/language/tree/master/language/question_answering/b2t2).

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes