CLOct 10, 2025

Token-Level Policy Optimization: Linking Group-Level Rewards to Token-Level Aggregation via Markov Likelihood

Xingyu Lin, Yilin Wen, En Wang, Du Su, Wenbin Liu, Chenfu Bao, Zhonghou Lv

arXiv:2510.09369v19.64 citationsh-index: 1

Originality Incremental advance

AI Analysis

This work addresses training stability and performance issues in reasoning for large language models, representing an incremental improvement over existing entropy-regularization methods.

The paper tackles the problem of sparse token rewards in chain-of-thought reasoning for large language models, proposing TEPO, a token-level framework that links group-level rewards to tokens via Markov likelihood, resulting in consistent outperformance of baselines in metrics like accuracy and setting a new state of the art on mathematical reasoning tasks.

Group Relative Policy Optimization (GRPO) has significantly advanced the reasoning ability of large language models (LLMs), particularly by boosting their mathematical performance. However, GRPO and related entropy-regularization methods still face challenges rooted in the sparse token rewards inherent to chain-of-thought (CoT). Current approaches often rely on undifferentiated token-level entropy adjustments, which frequently lead to entropy collapse or model collapse. In this work, we propose TEPO, a novel token-level framework that incorporates Markov Likelihood (sequence likelihood) links group-level rewards with tokens via token-level aggregation. Experiments show that TEPO consistently outperforms existing baselines across key metrics (including @k and accuracy). It not only sets a new state of the art on mathematical reasoning tasks but also significantly enhances training stability.

View on arXiv PDF

Similar