Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

arXiv:2604.0856365.1h-index: 4

AI Analysis

This work addresses the optimization of temperature and prompting for researchers and practitioners using extended reasoning LLMs, offering incremental insights into performance tuning.

The study investigated how temperature settings affect prompting strategies in extended reasoning large language models, finding that zero-shot prompting peaks at moderate temperatures with 59% accuracy, while chain-of-thought benefits more at extremes, with extended reasoning gains increasing from 6x to 14.3x.

Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.

View on arXiv PDF

Similar