CV AI LG MA ROSep 6, 2025

InterAct: A Large-Scale Dataset of Dynamic, Expressive and Interactive Activities between Two People in Daily Scenarios

Leo Ho, Yinghao Huang, Dafei Qin, Mingyi Shi, Wangpok Tse, Wei Liu, Junichi Yamagishi, Taku Komura

arXiv:2509.05747v18.42 citationsh-index: 4Proc ACM Comput Graph Interact Tech

Originality Synthesis-oriented

AI Analysis

This work addresses the need for datasets and methods to model dynamic, long-term interactions between two people, which is incremental as it builds on prior work focused on single-person or limited two-person interactions.

The authors tackled the problem of capturing interactive behaviors between two people in daily scenarios by introducing the InterAct dataset, which includes 241 motion sequences with audio, body motions, and facial expressions, and demonstrated a diffusion-based method that estimates interactive motions from speech inputs.

We address the problem of accurate capture of interactive behaviors between two people in daily scenarios. Most previous works either only consider one person or solely focus on conversational gestures of two people, assuming the body orientation and/or position of each actor are constant or barely change over each interaction. In contrast, we propose to simultaneously model two people's activities, and target objective-driven, dynamic, and semantically consistent interactions which often span longer duration and cover bigger space. To this end, we capture a new multi-modal dataset dubbed InterAct, which is composed of 241 motion sequences where two people perform a realistic and coherent scenario for one minute or longer over a complete interaction. For each sequence, two actors are assigned different roles and emotion labels, and collaborate to finish one task or conduct a common interaction activity. The audios, body motions, and facial expressions of both persons are captured. InterAct contains diverse and complex motions of individuals and interesting and relatively long-term interaction patterns barely seen before. We also demonstrate a simple yet effective diffusion-based method that estimates interactive face expressions and body motions of two people from speech inputs. Our method regresses the body motions in a hierarchical manner, and we also propose a novel fine-tuning mechanism to improve the lip accuracy of facial expressions. To facilitate further research, the data and code is made available at https://hku-cg.github.io/interact/ .

View on arXiv PDF

Similar