CV LGMar 16

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

arXiv:2603.1479453.51 citationsh-index: 5

Predicted impact top 65% in CV · last 90 daysOriginality Incremental advance

AI Analysis

This provides a dataset and baseline for studying dyadic, sequential behavior in video, addressing a gap in multi-person interaction modeling for researchers in AI and human-computer interaction.

The paper tackles the problem of modeling reactive human conversation by introducing the Face-to-Face with Jimmy Fallon dataset, a 70-hour collection of two-person talk-show exchanges, and demonstrates its utility with a speech-driven digital avatar task that yields small but consistent gains in Emotion-FID and FVD metrics.

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

View on arXiv PDF

Similar