LG AI CLSep 25, 2025

StyleBench: Evaluating thinking styles in Large Language Models

Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei

arXiv:2509.20868v111.46 citationsh-index: 56Has Code

Originality Incremental advance

AI Analysis

This work provides a roadmap for selecting reasoning strategies in LLMs, addressing a domain-specific need for optimizing model performance based on constraints.

The paper tackles the problem of understanding how reasoning styles affect Large Language Models (LLMs) by introducing StyleBench, a benchmark that evaluates five styles across diverse tasks and models, revealing that no single style is universally optimal and that efficacy depends on model scale and task type, with search-based methods excelling in open-ended problems and concise styles achieving efficiency gains.

The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

View on arXiv PDF Code

Similar