CLCVJun 27, 2024

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

arXiv:2406.18925v325 citations
Originality Incremental advance
AI Analysis

This addresses the challenge of visual reasoning for AI systems in domains like advertising or social causes, presenting a new benchmark but is incremental in dataset creation.

The paper tackles the problem of AI understanding visual arguments, which require selective vision to identify relevant visual stimuli within images, by introducing the VisArgs dataset with 1,611 images and three evaluation tasks. Results show machines struggle, with GPT-4-O achieving 78.5% accuracy compared to humans at 98.0%, and performance drops 19.5% when distinguishing irrelevant objects.

Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction. Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5% worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises improved model performance significantly.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes