CVNov 13, 2021

Where to Look: A Unified Attention Model for Visual Recognition with Reinforcement Learning

arXiv:2111.07169v11.4

Originality Incremental advance

AI Analysis

This work addresses a specific bottleneck in visual attention models for computer vision researchers, offering incremental improvements in stability and accuracy.

The paper tackles the problem of high variance and instability in recurrent attention models for visual recognition by unifying top-down and bottom-up attention with reinforcement learning, resulting in improved performance over CNNs and baseline attention models on visual classification tasks.

The idea of using the recurrent neural network for visual attention has gained popularity in computer vision community. Although the recurrent attention model (RAM) leverages the glimpses with more large patch size to increasing its scope, it may result in high variance and instability. For example, we need the Gaussian policy with high variance to explore object of interests in a large image, which may cause randomized search and unstable learning. In this paper, we propose to unify the top-down and bottom-up attention together for recurrent visual attention. Our model exploits the image pyramids and Q-learning to select regions of interests in the top-down attention mechanism, which in turn to guide the policy search in the bottom-up approach. In addition, we add another two constraints over the bottom-up recurrent neural networks for better exploration. We train our model in an end-to-end reinforcement learning framework, and evaluate our method on visual classification tasks. The experimental results outperform convolutional neural networks (CNNs) baseline and the bottom-up recurrent attention models on visual classification tasks.

View on arXiv PDF

Similar