Emergence of Fixational and Saccadic Movements in a Multi-Level Recurrent Attention Model for Vision
This addresses the issue of unrealistic visual exploration in attention models for computer vision researchers, though it is incremental as it builds on prior models like RAM and DRAM.
The paper tackled the problem of existing hard attention models producing unrealistic eye movement behaviors by proposing a Multi-Level Recurrent Attention Model (MRAM) that explicitly models neural hierarchy, resulting in more human-like attention dynamics and outperforming baselines on image classification benchmarks.
Inspired by foveal vision, hard attention models promise interpretability and parameter economy. However, existing models like the Recurrent Model of Visual Attention (RAM) and Deep Recurrent Attention Model (DRAM) failed to model the hierarchy of human vision system, that compromise on the visual exploration dynamics. As a result, they tend to produce attention that are either overly fixational or excessively saccadic, diverging from human eye movement behavior. In this paper, we propose a Multi-Level Recurrent Attention Model (MRAM), a novel hard attention framework that explicitly models the neural hierarchy of human visual processing. By decoupling the function of glimpse location generation and task execution in two recurrent layers, MRAM emergent a balanced behavior between fixation and saccadic movement. Our results show that MRAM not only achieves more human-like attention dynamics, but also consistently outperforms CNN, RAM and DRAM baselines on standard image classification benchmarks.