GAMR: A Guided Attention Model for (visual) Reasoning
This work addresses the challenge of improving AI systems in complex visual reasoning, offering a novel approach that supports cognitive theories on attention and memory interplay, though it appears incremental in advancing existing active vision methods.
The paper tackles the problem of visual reasoning by proposing GAMR, a module that uses guided attention to dynamically select and route visual information into memory, achieving robust and sample-efficient learning on various tasks and demonstrating zero-shot generalization to novel reasoning tasks.
Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR), which instantiates an active vision theory -- positing that the brain solves complex visual reasoning problems dynamically -- via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.