20.3CVMar 26
Infinite Gaze Generation for Videos with Autoregressive DiffusionJenna Kang, Colin Groth, Tong Wu et al.
Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows ($\approx$ 3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.
CVSep 1, 2025
MILO: A Lightweight Perceptual Quality Metric for Image and Latent-Space OptimizationUğur Çoğalan, Mojtaba Bemana, Karol Myszkowski et al.
We present MILO (Metric for Image- and Latent-space Optimization), a lightweight, multiscale, perceptual metric for full-reference image quality assessment (FR-IQA). MILO is trained using pseudo-MOS (Mean Opinion Score) supervision, in which reproducible distortions are applied to diverse images and scored via an ensemble of recent quality metrics that account for visual masking effects. This approach enables accurate learning without requiring large-scale human-labeled datasets. Despite its compact architecture, MILO outperforms existing metrics across standard FR-IQA benchmarks and offers fast inference suitable for real-time applications. Beyond quality prediction, we demonstrate the utility of MILO as a perceptual loss in both image and latent domains. In particular, we show that spatial masking modeled by MILO, when applied to latent representations from a VAE encoder within Stable Diffusion, enables efficient and perceptually aligned optimization. By combining spatial masking with a curriculum learning strategy, we first process perceptually less relevant regions before progressively shifting the optimization to more visually distorted areas. This strategy leads to significantly improved performance in tasks like denoising, super-resolution, and face restoration, while also reducing computational overhead. MILO thus functions as both a state-of-the-art image quality metric and as a practical tool for perceptual optimization in generative pipelines.