CVAIDec 7, 2024

Black Swan: Abductive and Defeasible Video Reasoning in Unpredictable Events

arXiv:2412.05725v211 citationsh-index: 27Has CodeCVPR
Originality Synthesis-oriented
AI Analysis

This addresses the need for better evaluation of VLMs' reasoning abilities in atypical scenarios, which is crucial for advancing AI robustness, though it is incremental as it focuses on benchmarking rather than proposing new models.

The paper tackles the problem of evaluating commonsense reasoning in vision-language models (VLMs) for unpredictable events by introducing BlackSwanSuite, a benchmark with over 15,000 questions across 1,655 videos, and finds that state-of-the-art VLMs like GPT-4o and Gemini 1.5 Pro show performance gaps of up to 32% compared to humans.

The commonsense reasoning capabilities of vision-language models (VLMs), especially in abductive reasoning and defeasible reasoning, remain poorly understood. Most benchmarks focus on typical visual scenarios, making it difficult to discern whether model performance stems from keen perception and reasoning skills, or reliance on pure statistical recall. We argue that by focusing on atypical events in videos, clearer insights can be gained on the core capabilities of VLMs. Explaining and understanding such out-of-distribution events requires models to extend beyond basic pattern recognition and regurgitation of their prior knowledge. To this end, we introduce BlackSwanSuite, a benchmark for evaluating VLMs' ability to reason about unexpected events through abductive and defeasible tasks. Our tasks artificially limit the amount of visual information provided to models while questioning them about hidden unexpected events, or provide new visual information that could change an existing hypothesis about the event. We curate a comprehensive benchmark suite comprising over 3,800 MCQ, 4,900 generative and 6,700 yes/no questions, spanning 1,655 videos. After extensively evaluating various state-of-the-art VLMs, including GPT-4o and Gemini 1.5 Pro, as well as open-source VLMs such as LLaVA-Video, we find significant performance gaps of up to 32% from humans on these tasks. Our findings reveal key limitations in current VLMs, emphasizing the need for enhanced model architectures and training strategies. Our data and leaderboard is available at blackswan.cs.ubc.ca.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes