CVAug 29, 2024

Exploiting temporal information to detect conversational groups in videos and predict the next speaker

arXiv:2408.16380v11 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work addresses the problem of analyzing social interactions in videos for applications like surveillance or human-computer interaction, but it is incremental as it builds on existing concepts like F formations and uses standard methods like LSTMs.

The paper tackled the problem of detecting conversational groups (F formations) and predicting the next speaker in videos by exploiting temporal information and multimodal signals, achieving 85% true positives in group detection and 98% accuracy in next speaker prediction.

Studies in human human interaction have introduced the concept of F formation to describe the spatial arrangement of participants during social interactions. This paper has two objectives. It aims at detecting F formations in video sequences and predicting the next speaker in a group conversation. The proposed approach exploits time information and human multimodal signals in video sequences. In particular, we rely on measuring the engagement level of people as a feature of group belonging. Our approach makes use of a recursive neural network, the Long Short Term Memory (LSTM), to predict who will take the speaker's turn in a conversation group. Experiments on the MatchNMingle dataset led to 85% true positives in group detection and 98% accuracy in predicting the next speaker.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes