AIMay 12

Selective Off-Policy Reference Tuning with Plan Guidance

Duc Anh Le, Tien-Phat Nguyen, Thien Huu Nguyen, Linh Ngo Van, Trung Le

arXiv:2605.1150586.3

Predicted impact top 26% in AI · last 90 daysOriginality Incremental advance

AI Analysis

For researchers working on reinforcement learning for reasoning tasks, SORT provides a method to extract useful signals from all-wrong rollouts, improving training efficiency and performance.

SORT addresses the problem of reinforcement learning stalling on hard prompts where all sampled rollouts fail, by using plan guidance to derive selective learning signals from failures. It improves over GRPO and guidance baselines across three backbones and eight reasoning benchmarks, with largest gains on weaker models.

Reinforcement learning with verifiable rewards helps reasoning, but GRPO-style methods stall on hard prompts where all sampled rollouts fail. SORT adds a repair update for those failures without changing rollout generation: it derives a plan from the reference solution, compares token probabilities with and without that plan, and gives higher weight to tokens that become more predictable under plan conditioning. This turns all-wrong prompts into selective, structure-aware learning signals instead of uniform imitation. Across three backbones and eight reasoning benchmarks, SORT improves over GRPO and guidance baselines, with largest gains on weaker models.

View on arXiv PDF

Similar