CVMar 18, 2023

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

arXiv:2303.10428v64 citationsh-index: 51Has Code
Originality Incremental advance
AI Analysis

This work addresses the problem of generating plausible explanations from visual observations for AI systems, representing an incremental improvement in visual reasoning tasks.

The paper tackles visual abductive reasoning by proposing a hybrid parameter-efficient fine-tuning method that enhances CLIP to infer explanations from local visual cues, achieving state-of-the-art results with a human accuracy of 31.74% on the Sherlock benchmark.

Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode "local hints" and "global contexts" into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the 1st on the leaderboards (e.g., Human Acc: RCA 31.74 $\textit{vs}$ CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at https://github.com/LUNAProject22/RPA.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes