AIOct 30, 2025

BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

Qianli Shen, Daoyuan Chen, Yilun Huang, Zhenqing Ling, Yaliang Li, Bolin Ding, Jingren Zhou

arXiv:2510.26374v28 citationsh-index: 18

Originality Incremental advance

AI Analysis

This addresses a domain-specific problem for LLM alignment practitioners by offering a more efficient and adaptive task selection method, though it is incremental as it builds on existing Bayesian and sampling techniques.

The paper tackles the inefficiency of task selection in reinforcement finetuning for LLMs by introducing BOTS, a Bayesian framework that adaptively estimates task difficulty and balances exploration and exploitation, resulting in improved data efficiency and performance across diverse domains and model scales.

Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce BOTS, a unified framework for Bayesian Online Task Selection in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates explicit evidence from direct evaluations of selected tasks and implicit evidence inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

View on arXiv PDF

Similar