CV LGJan 9, 2025

LongViTU: Instruction Tuning for Long-Form Video Understanding

Rujie Wu, Xiaojian Ma, Hai Ci, Yue Fan, Yuxuan Wang, Haozhe Zhao, Qing Li, Yizhou Wang

Peking U

arXiv:2501.05037v219.013 citationsh-index: 80Has Code

Originality Synthesis-oriented

AI Analysis

It addresses the challenge of long-term context and condensed reasoning in videos, which is incremental as it builds on existing video understanding methods with a new dataset.

The paper tackles the problem of long-form video understanding by introducing LongViTU, a large-scale dataset with hierarchical QA generation and self-revision, resulting in performance gains of 2.5% and 3.7% for fine-tuned models on benchmarks.

This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).

View on arXiv PDF

Similar