CVAIJan 9, 2025

Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning

arXiv:2501.05069v210 citationsh-index: 67CVPR
AI Analysis

This work addresses the issue of biased reasoning in video question answering for AI systems, offering a generalizable method that is incremental in enhancing existing models.

The paper tackles the problem of commonsense video question answering by proposing a video-grounded entailment tree reasoning method to reduce spurious correlations in large visual-language models, achieving improved performance across benchmarks and models as shown in systematic experiments.

This paper proposes the first video-grounded entailment tree reasoning method for commonsense video question answering (VQA). Despite the remarkable progress of large visual-language models (VLMs), there are growing concerns that they learn spurious correlations between videos and likely answers, reinforced by their black-box nature and remaining benchmarking biases. Our method explicitly grounds VQA tasks to video fragments in four steps: entailment tree construction, video-language entailment verification, tree reasoning, and dynamic tree expansion. A vital benefit of the method is its generalizability to current video and image-based VLMs across reasoning types. To support fair evaluation, we devise a de-biasing procedure based on large-language models that rewrites VQA benchmark answer sets to enforce model reasoning. Systematic experiments on existing and de-biased benchmarks highlight the impact of our method components across benchmarks, VLMs, and reasoning types.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes