CVDec 12, 2024

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders

Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg

Georgia Tech

arXiv:2412.09586v220.236 citationsh-index: 19Has CodeCVPR

Originality Incremental advance

AI Analysis

This addresses the problem of gaze prediction for applications like human-computer interaction, but it is incremental as it builds on prior work by simplifying the pipeline.

The paper tackles gaze target estimation by predicting where a person is looking in a scene, achieving state-of-the-art performance across several benchmarks with a novel transformer framework that uses a frozen DINOv2 encoder.

We address the problem of gaze target estimation, which aims to predict where a person is looking in a scene. Predicting a person's gaze target requires reasoning both about the person's appearance and the contents of the scene. Prior works have developed increasingly complex, hand-crafted pipelines for gaze target estimation that carefully fuse features from separate scene encoders, head encoders, and auxiliary models for signals like depth and pose. Motivated by the success of general-purpose feature extractors on a variety of visual tasks, we propose Gaze-LLE, a novel transformer framework that streamlines gaze target estimation by leveraging features from a frozen DINOv2 encoder. We extract a single feature representation for the scene, and apply a person-specific positional prompt to decode gaze with a lightweight module. We demonstrate state-of-the-art performance across several gaze benchmarks and provide extensive analysis to validate our design choices. Our code is available at: http://github.com/fkryan/gazelle .

View on arXiv PDF Code

Similar