LG AIJan 21

Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

Xiao Hu, Hong Xie, Tao Tan, Defu Lian, Jianyu Han

arXiv:2601.14599v11.4h-index: 8

Originality Incremental advance

AI Analysis

This work addresses fundamental questions about optimizing choices in reinforcement fine-tuning of LLMs, which is incremental as it builds on existing heuristics to provide clearer understanding.

The paper tackled the problem of inconsistent claims and lack of understanding in reinforcement fine-tuning of LLMs by proposing a bottom-up experiment pipeline to examine the role of each design choice, revealing new insights and essential findings on three LLMs and two reasoning datasets.

A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate the experiment findings. The up procedure of the experiment pipeline expanding the minimalist configuration layer by layer, examining the role of each design choice. Experimental results on three LLMs and two reasoning datasets not only reveal new understanding of the design choice but also yield essential insights to shape the area.

View on arXiv PDF

Similar