LGCLFeb 20, 2024

Reflect-RL: Two-Player Online RL Fine-Tuning for LMs

Tsinghua
arXiv:2402.12621v234 citationsh-index: 7Has CodeACL
Originality Incremental advance
AI Analysis

This addresses the challenge of improving language model performance in complex interactive environments, though it appears incremental as it builds on existing RL and reflection methods.

The paper tackles the problem of fine-tuning language models for multi-round interactive tasks by proposing Reflect-RL, a two-player online reinforcement learning system that outperforms supervised fine-tuning and online RL without reflection, with results showing it beats larger models like Mistral 7B.

As language models (LMs) demonstrate their capabilities in various fields, their application to tasks requiring multi-round interactions has become increasingly popular. These tasks usually have complex dynamics, so supervised fine-tuning (SFT) on a limited offline dataset does not yield good performance. However, only a few works attempted to directly train the LMs within interactive decision-making environments. We aim to create an effective approach to fine-tune LMs with online reinforcement learning (RL) in these environments. We propose Reflect-RL, a two-player system to fine-tune an LM using SFT and online RL, where a frozen reflection model (player) assists the policy model (player). To generate data for the warm-up SFT stage, we use negative example generation to enhance the error-correction ability of the reflection model. Furthermore, we designed single-prompt action enumeration and applied curriculum learning to allow the policy model to learn more efficiently. Empirically, we verify that Reflect-RL outperforms SFT and online RL without reflection. Testing results indicate GPT-2 XL 1.56B fine-tuned with Reflect-RL outperforms larger open-source LMs, such as Mistral 7B. The benchmarks, dataset, and code involved in this work are publicly available: https://github.com/zhourunlong/Reflect-RL.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes