Semantic Exploration from Language Abstractions and Pretrained Representations
This work addresses exploration difficulties in reinforcement learning for agents operating in complex, continuous 3D environments, offering a novel approach that could benefit various algorithms.
The paper tackles the challenge of exploration in high-dimensional reinforcement learning environments by using semantically meaningful state abstractions derived from vision-language representations pretrained on natural image captioning datasets, resulting in improved performance on 3D simulated environments.
Effective exploration is a challenge in reinforcement learning (RL). Novelty-based exploration methods can suffer in high-dimensional state spaces, such as continuous partially-observable 3D environments. We address this challenge by defining novelty using semantically meaningful state abstractions, which can be found in learned representations shaped by natural language. In particular, we evaluate vision-language representations, pretrained on natural image captioning datasets. We show that these pretrained representations drive meaningful, task-relevant exploration and improve performance on 3D simulated environments. We also characterize why and how language provides useful abstractions for exploration by considering the impacts of using representations from a pretrained model, a language oracle, and several ablations. We demonstrate the benefits of our approach in two very different task domains -- one that stresses the identification and manipulation of everyday objects, and one that requires navigational exploration in an expansive world. Our results suggest that using language-shaped representations could improve exploration for various algorithms and agents in challenging environments.