CLAIOct 10, 2025

Exploring Multi-Temperature Strategies for Token- and Rollout-Level Control in RLVR

arXiv:2510.08892v16 citationsh-index: 13Has Code
Originality Incremental advance
AI Analysis

This work addresses the challenge of balancing exploration and factual correctness in LLM reasoning for AI researchers, representing an incremental improvement over prior methods that indirectly restrict updates.

The paper tackles the problem of improving reasoning in Large Language Models by introducing distinct temperature settings for different token types during sampling, with higher temperatures for reasoning tokens to encourage exploration and lower temperatures for knowledge tokens to maintain factual correctness. Empirical results show that this approach significantly enhances reasoning performance on several benchmarks.

Reinforcement Learning has demonstrated substantial improvements in the reasoning abilities of Large Language Models (LLMs), exhibiting significant applicability across various domains. Recent research has identified that tokens within LLMs play distinct roles during reasoning tasks, categorizing them into high-entropy reasoning tokens and low-entropy knowledge tokens. Prior approaches have typically focused on restricting updates to indirectly encourage exploration, yet they do not explicitly facilitate exploratory behavior during the token generation stage itself. In this work, we introduce a complementary approach that explicitly promotes exploration during sampling by applying distinct temperature settings for different token types. Specifically, our method employs higher temperatures for reasoning tokens to actively encourage exploration, while retaining lower temperatures for knowledge tokens to maintain factual correctness. Furthermore, we systematically investigate various multi-temperature scheduling strategies and their impacts within reinforcement learning contexts. Empirical evaluations on several reasoning benchmarks demonstrate that our approach significantly enhances the reasoning performance of LLMs. The code is available at https://github.com/zhmzm/Multi_Temperature_Verl.git.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes