Vision-Language Models Provide Promptable Representations for Reinforcement Learning
This addresses the problem of slow, from-scratch learning in reinforcement learning for embodied agents, offering a novel integration of pre-trained models, though it is incremental in applying existing VLMs to RL.
The paper tackles the challenge of enabling reinforcement learning agents to leverage background world knowledge for faster learning by using vision-language models as promptable representations, resulting in policies that outperform generic image embeddings and instruction-following methods, with a 1.5x performance improvement in novel scenes using chain-of-thought prompting.
Humans can quickly learn new behaviors by leveraging background world knowledge. In contrast, agents trained with reinforcement learning (RL) typically learn behaviors from scratch. We thus propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied RL. We initialize policies with VLMs by using them as promptable representations: embeddings that encode semantic features of visual observations based on the VLM's internal knowledge and reasoning capabilities, as elicited through prompts that provide task context and auxiliary information. We evaluate our approach on visually-complex, long horizon RL tasks in Minecraft and robot navigation in Habitat. We find that our policies trained on embeddings from off-the-shelf, general-purpose VLMs outperform equivalent policies trained on generic, non-promptable image embeddings. We also find our approach outperforms instruction-following methods and performs comparably to domain-specific embeddings. Finally, we show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.