CVAICLJan 19, 2024

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

arXiv:2401.10711v419 citationsMM
Originality Incremental advance
AI Analysis

This work addresses the challenge of improving VideoQA accuracy for applications like video understanding and retrieval, but it is incremental as it builds on existing LMMs and CLIP models with a novel grounding module.

The paper tackles the problem of Video Question Answering (VideoQA) by addressing the inefficiency of Large Multimodal Models (LMMs) that use uniformly sampled frames, ignoring question-relevant visual clues, and proposes a weakly supervised framework to use question-critical moments as visual inputs, achieving substantial improvements over previous state-of-the-art methods.

Video Question Answering (VideoQA) aims to answer natural language questions based on the information observed in videos. Despite the recent success of Large Multimodal Models (LMMs) in image-language understanding and reasoning, they deal with VideoQA insufficiently, by simply taking uniformly sampled frames as visual inputs, which ignores question-relevant visual clues. Moreover, there are no human annotations for question-critical timestamps in existing VideoQA datasets. In light of this, we propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs. Specifically, we first fuse the question and answer pairs as event descriptions to find multiple keyframes as target moments and pseudo-labels, with the visual-language alignment capability of the CLIP models. With these pseudo-labeled keyframes as additionally weak supervision, we devise a lightweight Gaussian-based Contrastive Grounding (GCG) module. GCG learns multiple Gaussian functions to characterize the temporal structure of the video, and sample question-critical frames as positive moments to be the visual inputs of LMMs. Extensive experiments on several benchmarks verify the effectiveness of our framework, and we achieve substantial improvements compared to previous state-of-the-art methods.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes