CVMar 26

Infinite Gaze Generation for Videos with Autoregressive Diffusion

arXiv:2603.2493820.3h-index: 11
Predicted impact top 30% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the challenge of capturing fine-grained temporal dynamics and long-range dependencies in gaze prediction for videos, which is incremental by extending existing methods to arbitrary lengths.

The paper tackled the problem of predicting human gaze in videos over long durations, presenting a generative framework that uses an autoregressive diffusion model to synthesize raw gaze trajectories with continuous spatial coordinates and high-resolution timestamps, resulting in significant outperformance in long-range spatio-temporal accuracy and trajectory realism.

Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes