CVAug 10, 2023

Interaction-aware Joint Attention Estimation Using People Attributes

arXiv:2308.05382v19 citationsh-index: 25
Originality Incremental advance
AI Analysis

This work addresses the problem of accurately estimating joint attention in images for computer vision applications, representing an incremental improvement over existing methods.

The paper tackles joint attention estimation in single images by incorporating people's locations and actions as contextual cues and modeling interactions among attributes, achieving state-of-the-art performance in quantitative comparisons.

This paper proposes joint attention estimation in a single image. Different from related work in which only the gaze-related attributes of people are independently employed, (I) their locations and actions are also employed as contextual cues for weighting their attributes, and (ii) interactions among all of these attributes are explicitly modeled in our method. For the interaction modeling, we propose a novel Transformer-based attention network to encode joint attention as low-dimensional features. We introduce a specialized MLP head with positional embedding to the Transformer so that it predicts pixelwise confidence of joint attention for generating the confidence heatmap. This pixelwise prediction improves the heatmap accuracy by avoiding the ill-posed problem in which the high-dimensional heatmap is predicted from the low-dimensional features. The estimated joint attention is further improved by being integrated with general image-based attention estimation. Our method outperforms SOTA methods quantitatively in comparative experiments. Code: https://anonymous.4open.science/r/anonymized_codes-ECA4.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes