AICLMay 15, 2025

Reasoning Capabilities of Large Language Models on Dynamic Tasks

arXiv:2505.10543v23 citationsh-index: 28Has Code2025 3rd International Conference on Foundation and Large Language Models (FLLM)
Originality Synthesis-oriented
AI Analysis

This work addresses the problem of assessing reasoning capabilities in dynamic environments for AI researchers, revealing incremental insights into model limitations.

The study evaluated large language models on dynamic tasks using prompting strategies, finding that strategic prompting can narrow performance gaps between model sizes but advanced methods yield variable outcomes and persistent limitations in planning and spatial coordination, with little evidence of emergent reasoning compared to humans.

Large language models excel on static benchmarks, but their ability as self-learning agents in dynamic environments remains unclear. We evaluate three prompting strategies: self-reflection, heuristic mutation, and planning across dynamic tasks with open-source models. We find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, an overly long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in areas like planning and spatial coordination, suggesting that large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while methods like Chain-of-thought improve multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes