PRE-MAP: Personalized Reinforced Eye-tracking Multimodal LLM for High-Resolution Multi-Attribute Point Prediction
This work addresses the challenge of capturing subjective cognitive diversity in eye-tracking for applications like advertisement analysis, though it is incremental in combining reinforcement learning with MLLMs for a specific domain.
The paper tackles the problem of predicting personalized visual attention in high-resolution videos by addressing limitations in existing saliency models and multimodal large language models (MLLMs), resulting in a new dataset (SPA-ADV) and a model (PRE-MAP) that improves prediction accuracy through reinforcement learning.
Visual selective attention, driven by individual preferences, regulates human prioritization of visual stimuli by bridging subjective cognitive mechanisms with objective visual elements, thereby steering the semantic interpretation and hierarchical processing of dynamic visual scenes. However, existing models and datasets predominantly neglect the influence of subjective cognitive diversity on fixation behavior. Conventional saliency prediction models, typically employing segmentation approaches, rely on low-resolution imagery to generate saliency heatmaps, subsequently upscaled to native resolutions, which limiting their capacity to capture personalized attention patterns. Furthermore, MLLMs are constrained by factors such as hallucinations, making it very costly to strictly adhere to the expected format in tasks involving multiple point predictions, and achieving precise point positioning is challenging. To address these limitations, we present Subjective Personalized Attention for Advertisement Videos, namely SPA-ADV, a large-scale multimodal dataset capturing gaze behaviors from over 4,500 participants varying in age and gender with 486 videos. Furthermore, we propose PRE-MAP, a novel eye-tracking saliency model that characterizes Personalized visual disparities through Reinforcement learning-optimized Eye-tracking, built upon MLLMs and guided by Multi-Attribute user profiles to predict Points. To ensure MLLMs produce prediction points that are both format-correct and spatially accurate, we introduce Consistency Group Relative Policy Optimization (C-GRPO), inspired by the variability in eye movement points and Multi-Attribute profiles. Extensive experiments on SPA-ADV and other benchmarks demonstrate the effectiveness of our approach. The code and dataset are available at \href{https://github.com/mininglamp-MLLM/PRE-MAP}{this URL}.