OmniHuman: A Large-scale Dataset and Benchmark for Human-Centric Video Generation
For researchers in human-centric video generation, this provides a much-needed dataset and evaluation framework to advance beyond current limitations in scene diversity, interaction modeling, and attribute alignment.
OmniHuman addresses the lack of high-quality, diverse data for human-centric video generation by introducing a large-scale dataset with hierarchical annotations and a benchmark with perception-aligned metrics, enabling more realistic synthesis in complex scenes.
Recent advancements in audio-video joint generation models have demonstrated impressive capabilities in content creation. However, generating high-fidelity human-centric videos in complex, real-world physical scenes remains a significant challenge. We identify that the root cause lies in the structural deficiencies of existing datasets across three dimensions: limited global scene and camera diversity, sparse interaction modeling (both person-person and person-object), and insufficient individual attribute alignment. To bridge these gaps, we present OmniHuman, a large-scale, multi-scene dataset designed for fine-grained human modeling. OmniHuman provides a hierarchical annotation covering video-level scenes, frame-level interactions, and individual-level attributes. To facilitate this, we develop a fully automated pipeline for high-quality data collection and multi-modal annotation. Complementary to the dataset, we establish the OmniHuman Benchmark (OHBench), a three-level evaluation system that provides a scientific diagnosis for human-centric audio-video synthesis. Crucially, OHBench introduces metrics that are highly consistent with human perception, filling the gaps in existing benchmarks by providing a comprehensive diagnosis across global scenes, relational interactions, and individual attributes.