CVJun 2

Attend to Anything: Foundation Model for Unified Human Attention Modeling

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

arXiv:2606.0354070.0h-index: 2Has Code

AI Analysis

Provides a general-purpose foundation model for attention/saliency, addressing fragmentation across modalities and tasks.

AAM unifies human attention modeling across image, video, and audio-visual tasks, outperforming SOTA by 6% on 16 benchmarks with 4× speedup in video inference.

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

View on arXiv PDF Code

Similar