CVAILGMar 10, 2025

Towards Fine-Grained Video Question Answering

arXiv:2503.06820v11 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses fine-grained video understanding for AI researchers, but it is incremental as it builds on existing video-language models with new components.

The paper tackles the problem of limited temporal and spatial granularity in Video Question Answering by introducing the MOMA-QA dataset and the SGVLM model, achieving superior performance and setting new benchmarks on VideoQA tasks.

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes