Unsupervised Salient Patch Selection for Data-Efficient Reinforcement Learning
This addresses data efficiency for reinforcement learning practitioners, but it is incremental as it builds on existing self-supervised vision transformer methods.
The paper tackles the problem of sample inefficiency in vision-based deep reinforcement learning by proposing SPIRL, a method that automatically extracts important patches from input images, resulting in improved data efficiency validated on Atari games against state-of-the-art methods.
To improve the sample efficiency of vision-based deep reinforcement learning (RL), we propose a novel method, called SPIRL, to automatically extract important patches from input images. Following Masked Auto-Encoders, SPIRL is based on Vision Transformer models pre-trained in a self-supervised fashion to reconstruct images from randomly-sampled patches. These pre-trained models can then be exploited to detect and select salient patches, defined as hard to reconstruct from neighboring patches. In RL, the SPIRL agent processes selected salient patches via an attention module. We empirically validate SPIRL on Atari games to test its data-efficiency against relevant state-of-the-art methods, including some traditional model-based methods and keypoint-based models. In addition, we analyze our model's interpretability capabilities.