CVLGJul 2, 2024

The Solution for the ICCV 2023 Perception Test Challenge 2023 -- Task 6 -- Grounded videoQA

arXiv:2407.01907v1h-index: 2
Originality Synthesis-oriented
AI Analysis

This work addresses a specific problem in video understanding for computer vision researchers, but it appears incremental as it builds on existing methods without broad SOTA claims.

The paper tackles the challenge of grounded video question answering by addressing issues with the baseline method's visual grounding step, where selected frames may lack clear target objects and single images cannot handle complex queries. They propose a two-stage approach using VALOR for question answering and TubeDETR for bounding box generation, achieving results on the ICCV 2023 Perception Test Challenge.

In this paper, we introduce a grounded video question-answering solution. Our research reveals that the fixed official baseline method for video question answering involves two main steps: visual grounding and object tracking. However, a significant challenge emerges during the initial step, where selected frames may lack clearly identifiable target objects. Furthermore, single images cannot address questions like "Track the container from which the person pours the first time." To tackle this issue, we propose an alternative two-stage approach:(1) First, we leverage the VALOR model to answer questions based on video information.(2) concatenate the answered questions with their respective answers. Finally, we employ TubeDETR to generate bounding boxes for the targets.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes