CVAIAug 6, 2021

Interpretable Visual Understanding with Cognitive Attention Network

arXiv:2108.02924v319 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for reliable visual scene understanding beyond recognition-level tasks, though it appears incremental as it builds on existing benchmarks and methods.

The paper tackles the problem of achieving interpretable visual understanding by proposing a Cognitive Attention Network (CAN) for visual commonsense reasoning, which fuses image-text information and encodes commonsense, demonstrating effectiveness on the VCR benchmark dataset.

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on large-scale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes