CVOct 27, 2025

A Video Is Not Worth a Thousand Words

arXiv:2510.23253v11 citationsh-index: 1Has Code

Originality Incremental advance

AI Analysis

This addresses concerns about text bias in multi-modal AI for researchers and developers, but it is incremental as it provides new metrics rather than a breakthrough solution.

The paper tackled the problem of measuring text dominance and modality interactions in vision-language models for video question answering, finding that models heavily rely on text and the task reduces to ignoring distractors.

As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model's ability to ignore distractors. Code available at https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.

View on arXiv PDF Code

Similar