CVAIJun 12, 2024

LVBench: An Extreme Long Video Understanding Benchmark

arXiv:2406.08035v3348 citations
AI Analysis

This addresses the problem of evaluating long video comprehension for applications like embodied intelligence and movie reviews, though it is incremental as it builds on existing multimodal model frameworks.

The authors tackled the lack of benchmarks for long video understanding by introducing LVBench, a dataset with tasks for videos spanning several hours, and found that current multimodal models underperform on these demanding tasks.

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension. Our data and code are publicly available at: https://lvbench.github.io.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes