CVLGMar 16

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

arXiv:2603.1479453.51 citationsh-index: 5
Predicted impact top 65% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This provides a dataset and baseline for studying dyadic, sequential behavior in video, addressing a gap in multi-person interaction modeling for researchers in AI and human-computer interaction.

The paper tackles the problem of modeling reactive human conversation by introducing the Face-to-Face with Jimmy Fallon dataset, a 70-hour collection of two-person talk-show exchanges, and demonstrates its utility with a speech-driven digital avatar task that yields small but consistent gains in Emotion-FID and FVD metrics.

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_1,t_2]$ is generated from their audio plus the guest's preceding video during $[t_0,t_1]$. Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small but consistent Emotion-FID and FVD gains while preserving lip-sync quality relative to an audio-only baseline. The dataset, preprocessing recipe, and baseline together provide an end-to-end blueprint for studying dyadic, sequential behavior, which we expand upon throughout the paper. Dataset and code will be made publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes