CVApr 29

Automated Detection of Mutual Gaze and Joint Attention in Dual-Camera Settings via Dual-Stream Transformers

Jakub Kosmydel, Paweł Gajewski, Arkadiusz Białek

arXiv:2604.2710514.7

Predicted impact top 93% in CV · last 90 daysOriginality Incremental advance

AI Analysis

It provides behavioral scientists with a scalable, open-source tool to automate labor-intensive coding of social interactions in multi-camera laboratory settings.

The paper introduces a dual-stream Transformer architecture for automated detection of mutual gaze and joint attention from dual-camera recordings, outperforming convolutional baselines and a multimodal LLM on caregiver-infant interaction data.

Analyzing mutual gaze (MG) and joint attention (JA) is critical in developmental psychology but traditionally relies on labor-intensive manual coding. Automating this process in multi-camera laboratory settings is computationally challenging due to complex cross-camera relational dynamics. In this paper, we propose a highly efficient dual-stream Transformer architecture for detecting MG and JA from synchronized dual-camera recordings. Our approach leverages frozen gaze-aware backbones (GazeLLE) to extract rich visual priors, combined with a custom token fusion mechanism to map the spatial and semantic relationships between interacting dyads. Evaluated on an ecologically valid dataset of caregiver-infant interactions, our model exhibits good performance, significantly outperforming both a convolutional baseline and a state-of-the-art multimodal Large Language Model (LLM). By open-sourcing our model and pre-trained weights, we provide behavioral scientists with a scalable tool that can be fine-tuned to diverse laboratory environments, effectively bridging the gap between computational modeling and applied interaction research.

View on arXiv PDF

Similar