CVAIJun 11, 2025

TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision

Amazon
arXiv:2506.09445v15 citationsh-index: 6
Originality Incremental advance
AI Analysis

This work addresses video QA with temporal grounding for AI and multimedia applications, but it is incremental as it builds on existing vision-language models and weak supervision techniques.

The authors tackled the problem of video question answering with temporal grounding using only weak supervision, without temporal annotations, by proposing TOGA, a model that jointly generates answers and temporal grounding, achieving state-of-the-art performance on benchmarks like NExT-GQA, MSVD-QA, and ActivityNet-QA.

We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes