CLAILGMay 20, 2025

KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation

arXiv:2505.14552v27 citationsh-index: 18
Originality Synthesis-oriented
AI Analysis

This provides a new benchmark for evaluating reasoning in LLMs, which is incremental as it builds on existing platforms like KOR-Bench and Gymnasium.

The authors tackled the need for better evaluation of LLM reasoning by introducing KORGym, a dynamic platform with over 50 games, and found that closed-source models performed best in tests on 19 LLMs and 8 VLMs.

Recent advancements in large language models (LLMs) underscore the need for more comprehensive evaluation methods to accurately assess their reasoning capabilities. Existing benchmarks are often domain-specific and thus cannot fully capture an LLM's general reasoning potential. To address this limitation, we introduce the Knowledge Orthogonal Reasoning Gymnasium (KORGym), a dynamic evaluation platform inspired by KOR-Bench and Gymnasium. KORGym offers over fifty games in either textual or visual formats and supports interactive, multi-turn assessments with reinforcement learning scenarios. Using KORGym, we conduct extensive experiments on 19 LLMs and 8 VLMs, revealing consistent reasoning patterns within model families and demonstrating the superior performance of closed-source models. Further analysis examines the effects of modality, reasoning strategies, reinforcement learning techniques, and response length on model performance. We expect KORGym to become a valuable resource for advancing LLM reasoning research and developing evaluation methodologies suited to complex, interactive environments.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes