LGFeb 5, 2024

Fine-tuning Reinforcement Learning Models is Secretly a Forgetting Mitigation Problem

Maciej Wołczyk, Bartłomiej Cupiał, Mateusz Ostaszewski, Michał Bortkiewicz, Michał Zając, Razvan Pascanu, Łukasz Kuciński, Piotr Miłoś

DeepMind

arXiv:2402.02868v322.730 citationsh-index: 60Has CodeICML

Originality Incremental advance

AI Analysis

This addresses a critical problem for practitioners in reinforcement learning by enabling more effective transfer of pre-trained models, though it is incremental as it applies existing techniques to a specific setting.

The paper identifies forgetting of pre-trained capabilities as a key cause of poor fine-tuning in reinforcement learning, and shows that standard knowledge retention techniques mitigate this issue, achieving a new state-of-the-art score in NetHack by improving from 5K to over 10K points.

Fine-tuning is a widespread technique that allows practitioners to transfer pre-trained capabilities, as recently showcased by the successful applications of foundation models. However, fine-tuning reinforcement learning (RL) models remains a challenge. This work conceptualizes one specific cause of poor transfer, accentuated in the RL setting by the interplay between actions and observations: forgetting of pre-trained capabilities. Namely, a model deteriorates on the state subspace of the downstream task not visited in the initial phase of fine-tuning, on which the model behaved well due to pre-training. This way, we lose the anticipated transfer benefits. We identify conditions when this problem occurs, showing that it is common and, in many cases, catastrophic. Through a detailed empirical analysis of the challenging NetHack and Montezuma's Revenge environments, we show that standard knowledge retention techniques mitigate the problem and thus allow us to take full advantage of the pre-trained capabilities. In particular, in NetHack, we achieve a new state-of-the-art for neural models, improving the previous best score from $5$K to over $10$K points in the Human Monk scenario.

View on arXiv PDF Code

Similar