Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
For gaze following researchers, this addresses a known bottleneck in VFM-based methods by improving gaze reasoning without sacrificing scene understanding.
Vision foundation models (VFMs) improve scene understanding for gaze following but lack gaze reasoning, often relying on salient objects. The authors propose a head-conditioned local LoRA and out-of-cone penalty to enhance gaze reasoning, achieving state-of-the-art on GazeFollow and VAT, especially for non-salient targets.
Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a key limitation of VFM-based approaches: while VFMs substantially improve scene understanding, they contribute little to gaze reasoning. As a result, existing methods often rely on semantically salient objects rather than true gaze cues, leading to degraded performance when targets are not salient. To address this, we propose a novel training mechanism to enhance gaze reasoning in VFMs for gaze following. Our method includes: (1) a head-conditioned local LoRA, which enables localized adaptation to preserve scene token learning while improving head token learning for gaze reasoning; and (2) an out-of-cone penalty, which injects gaze cues into head tokens while aligning them with scene tokens. Experiments on the GazeFollow and VAT datasets demonstrate that our method achieves state-of-the-art performance, with particularly strong improvements when gaze targets are not semantically salient. Our findings offer valuable insights for advancing future gaze following research. We will release the code once the paper is accepted.