CVAIJul 7, 2025

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding

arXiv:2507.04909v2h-index: 34
Originality Synthesis-oriented
AI Analysis

This work addresses a gap in benchmarking for researchers and developers working on MLLMs applied to human-centered video analysis, though it is incremental as it builds upon existing benchmark frameworks.

The authors tackled the lack of comprehensive benchmarks for evaluating Multimodal Large Language Models (MLLMs) in human-centric video understanding by proposing HV-MMBench, which includes 13 tasks, 50 visual scenarios, and videos up to 30 minutes long, enabling a holistic assessment of model capabilities.

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos. However, their capacity to comprehend human-centric video data remains underexplored, primarily due to the absence of comprehensive and high-quality evaluation benchmarks. Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios. Furthermore, they are often limited by single-question paradigms and overly simplistic evaluation metrics. To address above limitations, we propose a modern HV-MMBench, a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding. Compared to existing human-centric video benchmarks, our work offers the following key features: (1) Diverse evaluation dimensions: HV-MMBench encompasses 13 tasks, ranging from basic attribute perception (e.g., age estimation, emotion recognition) to advanced cognitive reasoning (e.g., social relationship prediction, intention prediction), enabling comprehensive assessment of model capabilities; (2) Varied data types: The benchmark includes multiple-choice, fill-in-blank, true/false, and open-ended question formats, combined with diverse evaluation metrics, to more accurately and robustly reflect model performance; (3) Multi-domain video coverage: The benchmark spans 50 distinct visual scenarios, enabling comprehensive evaluation across fine-grained scene variations; (4) Temporal coverage: The benchmark covers videos from short-term (10 seconds) to long-term (up to 30min) durations, supporting systematic analysis of models temporal reasoning abilities across diverse contextual lengths.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes