CVLGJan 9, 2025

LongViTU: Instruction Tuning for Long-Form Video Understanding

Peking U
arXiv:2501.05037v213 citationsh-index: 10Has Code
AI Analysis

It addresses the challenge of long-term context and condensed reasoning in videos, which is incremental as it builds on existing video understanding methods with a new dataset.

The paper tackles the problem of long-form video understanding by introducing LongViTU, a large-scale dataset with hierarchical QA generation and self-revision, resulting in performance gains of 2.5% and 3.7% for fine-tuned models on benchmarks.

This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes