LG CV ROApr 11, 2022

Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels

Tianxin Tao, Daniele Reda, Michiel van de Panne

arXiv:2204.04905v214.621 citationsh-index: 57

Originality Synthesis-oriented

AI Analysis

This work addresses the applicability of Vision Transformers in reinforcement learning for control tasks, showing incremental improvements but no breakthrough over existing methods.

The paper evaluated Vision Transformer (ViT) methods for deep reinforcement learning from pixels, comparing them to a leading CNN-based method (RAD), and found that CNNs still generally outperform ViTs, though auxiliary tasks improved ViT performance with reconstruction-based tasks being best.

Vision Transformers (ViT) have recently demonstrated the significant potential of transformer architectures for computer vision. To what extent can image-based deep reinforcement learning also benefit from ViT architectures, as compared to standard convolutional neural network (CNN) architectures? To answer this question, we evaluate ViT training methods for image-based reinforcement learning (RL) control tasks and compare these results to a leading convolutional-network architecture method, RAD. For training the ViT encoder, we consider several recently-proposed self-supervised losses that are treated as auxiliary tasks, as well as a baseline with no additional loss terms. We find that the CNN architectures trained using RAD still generally provide superior performance. For the ViT methods, all three types of auxiliary tasks that we consider provide a benefit over plain ViT training. Furthermore, ViT reconstruction-based tasks are found to significantly outperform ViT contrastive-learning.

View on arXiv PDF

Similar