What is the objective of reasoning with reinforcement learning?
This work provides a theoretical unification for RL algorithms in LLMs, which is incremental but clarifies underlying mechanisms for researchers in AI and machine learning.
The paper demonstrates that popular reinforcement learning algorithms for large language models with binary rewards can be interpreted as stochastic gradient ascent on transformed probabilities of correct answers, specifically linking rejection sampling to logarithmic transformations and GRPO to arcsine-square-root transformations.
We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.