CVJul 7, 2022

AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention Estimation for Non-Profilic Faces

Shreya Ghosh, Abhinav Dhall, Munawar Hayat, Jarrod Knibbe

arXiv:2207.03048v23.73 citationsh-index: 43Has Code

Originality Incremental advance

AI Analysis

This addresses gaze estimation in real-world scenarios where visual data is limited, though it appears incremental by combining existing audio and visual methods.

The paper tackles the problem of visual attention estimation for non-profilic faces under challenging conditions like extreme head-pose and occlusions by using audio signals to complement visual information, achieving competitive results on benchmark datasets.

In challenging real-life conditions such as extreme head-pose, occlusions, and low-resolution images where the visual information fails to estimate visual attention/gaze direction, audio signals could provide important and complementary information. In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces. Since it is difficult to annotate audio signals for estimating the head-pose of the speaker, we use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision. During the training phase, the framework learns complementary information from synchronized audio-visual modality. Our model can utilize any of the available modalities i.e. audio, visual or audio-visual for task-specific inference. It is interesting to note that, when AV-Gaze is tested on benchmark datasets with these specific modalities, it achieves competitive results on multiple datasets, while being highly adaptive toward challenging scenarios.

View on arXiv PDF Code

Similar