LGOct 23, 2025

KL-Regularized Reinforcement Learning is Designed to Mode Collapse

Anthony GX-Chen, Jatin Prakash, Jeff Guo, Rob Fergus, Rajesh Ranganath

arXiv:2510.20817v123.918 citationsh-index: 5

Originality Incremental advance

AI Analysis

This work addresses mode collapse in KL-regularized reinforcement learning, a critical issue for generating diverse outputs in applications like language and chemical modeling, though it is incremental as it builds on existing regularization frameworks.

The paper challenges the common belief that reverse KL regularization in reinforcement learning leads to mode seeking and forward KL to mass covering, showing that both can cause mode collapse depending on regularization strength and reward scaling. It introduces a simple algorithm that modifies reward magnitudes to optimize for diverse, high-quality target distributions, achieving improved solution quality and diversity in Large Language Models and Chemical Language Models without external diversity signals.

It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.

View on arXiv PDF

Similar