Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations
This work addresses the challenge of interpreting social interactions for AI systems, but it is incremental as it builds on existing multimodal approaches with new tasks and data.
The paper tackles the problem of modeling fine-grained social interactions in multi-party environments by introducing three new tasks and a baseline method using densely aligned multimodal representations, achieving effective results as demonstrated in experiments.
Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.