CVAug 5, 2025

Multi-human Interactive Talking Dataset

arXiv:2508.03050v12 citationsh-index: 10Has Code

Originality Incremental advance

AI Analysis

This addresses the problem of limited realism in talking video generation for AI and computer vision researchers, though it is incremental as it builds on existing single-person methods.

The authors tackled the lack of datasets for multi-human talking video generation by introducing MIT, a large-scale dataset with 12 hours of high-resolution footage and fine-grained annotations, and proposed CovOG as a baseline model to demonstrate feasibility.

Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: https://github.com/showlab/Multi-human-Talking-Video-Dataset.

View on arXiv PDF Code

Similar