HCApr 29

UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?

arXiv:2604.2635228.2
AI Analysis

This work provides the first systematic evaluation of VLMs for predicting human attention on UIs, offering insights for UI design and saliency prediction.

UIGaze evaluates how well nine state-of-the-art VLMs predict human visual attention on user interfaces using real eye-tracking data from 1,980 UI screenshots. Results show moderate alignment with human gaze, varying by UI type and improving with longer viewing durations.

Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes