CLOct 10, 2025

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

arXiv:2510.09369v14 citationsh-index: 1
Originality Incremental advance
AI Analysis

This work addresses training stability and performance issues in reasoning for large language models, representing an incremental improvement over existing entropy-regularization methods.

The paper tackles the problem of sparse token rewards in chain-of-thought reasoning for large language models, proposing TEPO, a token-level framework that links group-level rewards to tokens via Markov likelihood, resulting in consistent outperformance of baselines in metrics like accuracy and setting a new state of the art on mathematical reasoning tasks.

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes