Learning to predict where to look in interactive environments using deep recurrent q-learning
This work addresses the challenge of visual attention prediction for humans engaged in tasks in interactive environments, representing an incremental improvement over existing bottom-up methods.
The paper tackled the problem of predicting where to look in interactive environments like video games, where bottom-up saliency models perform poorly, by using deep recurrent Q-learning with a soft attention mechanism to highlight task-relevant locations, resulting in significantly better fixation location predictions than models such as Itti-Koch and GBVS.
Bottom-Up (BU) saliency models do not perform well in complex interactive environments where humans are actively engaged in tasks (e.g., sandwich making and playing the video games). In this paper, we leverage Reinforcement Learning (RL) to highlight task-relevant locations of input frames. We propose a soft attention mechanism combined with the Deep Q-Network (DQN) model to teach an RL agent how to play a game and where to look by focusing on the most pertinent parts of its visual input. Our evaluations on several Atari 2600 games show that the soft attention based model could predict fixation locations significantly better than bottom-up models such as Itti-Kochs saliency and Graph-Based Visual Saliency (GBVS) models.