LG AIFeb 26, 2025

Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

Georgia Tech

arXiv:2502.19009v120.512 citationsh-index: 7Has CodeICLR

Originality Highly original

AI Analysis

This addresses sample efficiency and suboptimal behavior in in-context RL for AI systems, though it is incremental as it builds on existing Transformer-based RL methods.

The paper tackled the problem of Transformers inheriting suboptimal behaviors from imitated RL algorithms in in-context learning by proposing DICP, a framework that integrates model-based planning to improve policies without separate dynamics models, achieving state-of-the-art performance with fewer environment interactions.

Recent studies have shown that Transformers can perform in-context reinforcement learning (RL) by imitating existing RL algorithms, enabling sample-efficient adaptation to unseen tasks without parameter updates. However, these models also inherit the suboptimal behaviors of the RL algorithms they imitate. This issue primarily arises due to the gradual update rule employed by those algorithms. Model-based planning offers a promising solution to this limitation by allowing the models to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the suboptimal behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where Transformers simultaneously learn environment dynamics and improve policy in-context. We evaluate DICP across a range of discrete and continuous environments, including Darkroom variants and Meta-World. Our results show that DICP achieves state-of-the-art performance while requiring significantly fewer environment interactions than baselines, which include both model-free counterparts and existing meta-RL methods.

View on arXiv PDF Code

Similar