Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models
This addresses the problem of inefficient optimization for complex reasoning tasks in large language models, representing an incremental advancement in policy gradient methods.
The paper tackled the mismatch between token-level optimization and the block-level nature of complex reasoning in language models by proposing Multi-token Policy Gradient Optimization (MPO), which treats sequences of tokens as unified actions, resulting in improved performance on mathematical reasoning and coding benchmarks compared to standard token-level methods.
Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.