M3PT: A Transformer for Multimodal, Multi-Party Social Signal Prediction with Person-aware Blockwise Attention
This work addresses the challenge of building unified models for social signal prediction in multi-party settings, which is important for applications like human-robot interaction, but it is incremental as it adapts existing transformer methods to a specific domain.
The paper tackles the problem of predicting multimodal social signals in multi-party conversations using a single model, and demonstrates that incorporating multiple modalities improves bite timing and speaking status prediction on the Human-Human Commensality Dataset.
Understanding social signals in multi-party conversations is important for human-robot interaction and artificial social intelligence. Social signals include body pose, head pose, speech, and context-specific activities like acquiring and taking bites of food when dining. Past work in multi-party interaction tends to build task-specific models for predicting social signals. In this work, we address the challenge of predicting multimodal social signals in multi-party settings in a single model. We introduce M3PT, a causal transformer architecture with modality and temporal blockwise attention masking to simultaneously process multiple social cues across multiple participants and their temporal interactions. We train and evaluate M3PT on the Human-Human Commensality Dataset (HHCD), and demonstrate that using multiple modalities improves bite timing and speaking status prediction. Source code: https://github.com/AbrarAnwar/masked-social-signals/.