CVAIJan 3, 2025

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

arXiv:2501.01645v36 citationsh-index: 4ICME
AI Analysis

This addresses the problem of evaluating long video understanding models for researchers, but it is incremental as it focuses on dataset creation rather than novel methods.

The authors tackled the lack of large-scale benchmarks for hour-long video understanding by introducing HLV-1K, a dataset with 1009 videos and 14,847 QA pairs, and demonstrated its value through evaluations with existing state-of-the-art methods.

Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes