CVAug 19, 2025

HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes

Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen

arXiv:2508.13692v16.21 citationsh-index: 93

Originality Incremental advance

AI Analysis

This addresses the need for better evaluation of MLLMs in human-centric scenarios, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of evaluating multimodal large language models' (MLLMs) capabilities in human-centric visual contexts by proposing HumanPCR, an evaluation suite with over 6,000 questions across three hierarchical levels, revealing significant challenges in tasks like space perception and temporal understanding, with advanced techniques yielding only limited benefits.

The aspiration for artificial general intelligence, fueled by the rapid progress of multimodal models, demands human-comparable performance across diverse environments. We propose HumanPCR, an evaluation suite for probing MLLMs' capacity about human-related visual contexts across three hierarchical levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C, and Human-R, respectively). Human-P and Human-C feature over 6,000 human-verified multiple choice questions, assessing massive tasks of 9 dimensions, including but not limited to essential skills frequently overlooked by existing benchmarks. Human-R offers a challenging manually curated video reasoning test that requires integrating multiple visual evidences, proactively extracting context beyond question cues, and applying human-like expertise. Each question includes human-annotated Chain-of-Thought (CoT) rationales with key visual evidence to support further research. Extensive evaluations on over 30 state-of-the-art models exhibit significant challenges in human-centric visual understanding, particularly in tasks involving detailed space perception, temporal understanding, and mind modeling. Moreover, analysis of Human-R reveals the struggle of models in extracting essential proactive visual evidence from diverse human scenes and their faulty reliance on query-guided retrieval. Even with advanced techniques like scaling visual contexts and test-time thinking yield only limited benefits. We hope HumanPCR and our findings will advance the development, evaluation, and human-centric application of multimodal models.

View on arXiv PDF

Similar