CLAIFeb 7, 2024

The Effect of Sampling Temperature on Problem Solving in Large Language Models

arXiv:2402.05201v3286 citationsh-index: 4Has CodeEMNLP
AI Analysis

This work addresses a practical issue for AI researchers and practitioners by empirically testing a common assumption about LLM tuning, though it is incremental as it focuses on clarifying existing knowledge.

The study investigated how sampling temperature affects Large Language Models' performance on problem-solving tasks, finding that temperature changes from 0.0 to 1.0 do not have a statistically significant impact on accuracy across various models and techniques.

In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes