CVCLNov 25, 2025

CounterVQA: Evaluating and Improving Counterfactual Reasoning in Vision-Language Models for Video Understanding

arXiv:2511.19923v12 citationsHas Code
Originality Incremental advance
AI Analysis

This addresses the need for robust video understanding in AI systems by focusing on counterfactual reasoning, an essential but underexplored capability, though it is incremental as it builds on existing models and benchmarks.

The paper tackles the problem of evaluating and improving counterfactual reasoning in vision-language models for video understanding, introducing the CounterVQA benchmark and showing that while state-of-the-art models perform reasonably on simple questions, their accuracy degrades significantly on complex multi-hop causal chains, with the proposed CFGPT method yielding consistent improvements across all difficulty levels.

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes