CVAIJun 15, 2023

Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion

arXiv:2306.08889v38 citationsh-index: 38
Originality Incremental advance
AI Analysis

This work reveals limitations in current VideoQA models for learning coupled multimodal representations, which is an incremental but critical insight for the multimodal AI community.

The study investigated whether VideoQA Transformer models genuinely learn multimodal representations or exploit biases, finding that many achieve high benchmark scores without leveraging multimodal structures and perform poorly on a new stress-test dataset requiring high modality coupling.

While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: https://dissect-videoqa.github.io ).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes