LG CLOct 27, 2025

GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

arXiv:2510.23868v11 citations

Originality Incremental advance

AI Analysis

This work addresses alignment challenges in LLMs for AI safety and performance, offering an incremental improvement by integrating existing methods into a more efficient framework.

The paper tackles the problem of aligning large language models by proposing GIFT, a reinforcement learning framework that minimizes discrepancy between implicit and explicit reward models, achieving superior reasoning and alignment performance on mathematical benchmarks with faster convergence and better generalization.

I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

View on arXiv PDF

Similar