CVOct 5, 2025

Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

arXiv:2510.04022v32 citationsh-index: 10
Originality Incremental advance
AI Analysis

This addresses the challenge of scalable and interpretable long-video QA for AI systems, though it appears incremental as it builds on existing methods with token budget constraints.

The paper tackles the problem of long-video question answering by introducing a two-stage framework that localizes relevant intervals and reallocates visual tokens for efficient processing, achieving up to 8.6% improvement with 50% less frame input on benchmarks like Charades-STA and ActivityNet-Captions.

We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes