LG MLMay 4

Generalized Distributional Alignment Games for Unbiased Answer-Level Fine-Tuning

Mehryar Mohri, Jon Schneider, Yutao Zhong

arXiv:2605.0243562.0

AI Analysis

For practitioners of reinforcement learning from human feedback, this provides a principled fix to a known training instability, though the improvement is incremental over existing ALFT methods.

The paper resolves a systematic estimation bias in Distributional Alignment Games for Answer-Level Fine-Tuning by introducing unbiased estimators using U-statistics for polynomial rewards and a minimax optimal estimator for KL divergence, achieving a bias of Θ(1/K^2) and accelerated convergence with zero online overhead.

The Distributional Alignment Game framework provides a powerful variational perspective on Answer-Level Fine-Tuning (ALFT). However, standard algorithms for these games rely on estimating logarithmic rewards from small batches, introducing a systematic bias due to Jensen's inequality that can destabilize training. In this paper, we systematically resolve this structural estimation bias. First, we generalize the alignment game to arbitrary Bregman divergences, showing that for a family of geometries inducing polynomial rewards, we can construct provably exact and unbiased estimators using U-statistics. Second, for the canonical KL divergence game where an exact solution is impossible, we derive a globally robust minimax polynomial estimator that is provably optimal, achieving the fundamental statistical error limit of $Θ(1/K^2)$, which we establish via the Ditzian-Totik theorem. Finally, we synthesize these two approaches to propose a novel Variance-Optimal Augmented Polynomial Optimization Program (AQP) Estimator, proving that by systematically reducing variance, our method achieves not only optimal bias but also provably accelerated game convergence, leading to more efficient and stable training with zero online computational overhead.

View on arXiv PDF

Similar