AIMay 26

Cross-Entropy Games and Frost Training

arXiv:2605.2770174.1h-index: 15
AI Analysis

This method offers a novel training signal for LLM alignment tasks, but its validation is limited to a specific infilling task, making it incremental.

Frost Training improves Monte Carlo-based policy optimization for LLM-as-a-judge tasks by exploiting reward gradients in embedding space, achieving higher maximum scores in best-of-k settings with increased training speed.

We present Frost Training, a method for improving Monte Carlo-based policy optimization for a large family of LLM-as-a-judge tasks called Cross-Entropy Games. The key idea is to exploit the gradient of the reward function in embedding space. This signal is used in the Greedy Coordinate Gradient (GCG) jailbreaking technique; we demonstrate for the first time that it can also be used to boost model training. We validate our method using GRPO training for maximum-likelihood infilling. Frost Training improves the model's ability to generate high-scoring outputs, reaching higher maximum scores in a best-of-k setting, and does so at an increased speed.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes