LGOCOct 15, 2025

What is the objective of reasoning with reinforcement learning?

arXiv:2510.13651v113 citationsh-index: 1
Originality Synthesis-oriented
AI Analysis

This work provides a theoretical unification for RL algorithms in LLMs, which is incremental but clarifies underlying mechanisms for researchers in AI and machine learning.

The paper demonstrates that popular reinforcement learning algorithms for large language models with binary rewards can be interpreted as stochastic gradient ascent on transformed probabilities of correct answers, specifically linking rejection sampling to logarithmic transformations and GRPO to arcsine-square-root transformations.

We show that several popular algorithms for reinforcement learning in large language models with binary rewards can be viewed as stochastic gradient ascent on a monotone transform of the probability of a correct answer given a prompt. In particular, the transformation associated with rejection sampling algorithms is the logarithm and that associated with the GRPO algorithm is the arcsine of the square root.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes