CVMay 26

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

arXiv:2605.2639981.1Has Code

Predicted impact top 23% in CV · last 90 daysOriginality Highly original

AI Analysis

For researchers in gaze following and human-computer interaction, OmniGF unifies spatial and semantic gaze tasks in a scalable multi-person framework, overcoming the precision limitations of text-only VLM outputs.

OmniGF proposes a dual-branch vision-language framework that adapts VLMs for multi-person gaze reasoning, achieving state-of-the-art performance across multiple benchmarks by combining discrete semantic reasoning with continuous spatial heatmap decoding.

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

View on arXiv PDF Code

Similar