Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models
This work solves the issue of inefficient credit assignment in reinforcement learning for LoopLMs, benefiting researchers and practitioners in AI reasoning by providing a method that enhances performance on mathematical and non-mathematical benchmarks, though it is incremental as it builds on existing LoopLM and GRPO frameworks.
The paper tackled the problem of improving reasoning in Looped Language Models (LoopLMs) by addressing the mismatch in reinforcement learning objectives that only reward final outcomes, introducing RLTT to distribute rewards across latent reasoning trajectories, which resulted in accuracy improvements of +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME.
Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-2.6B-Thinking under identical training and inference conditions, RLTT yields substantial improvements over GRPO on challenging mathematical reasoning benchmarks, improving accuracy by +14.4% on MATH-500, +16.6% on AIME24, and +10.0% on BeyondAIME. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs.