AICLCYLGAug 10, 2025

Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy

arXiv:2508.07485v13 citationsh-index: 2Has Code
Originality Incremental advance
AI Analysis

This democratizes the study of strategic reasoning in LLMs for researchers by eliminating the need for fine-tuning, though it is incremental as it builds on existing LLM capabilities.

The authors tackled the problem of evaluating strategic reasoning in large language models (LLMs) on the complex game Diplomacy by creating the first evaluation harness that allows any out-of-the-box LLM to play without fine-tuning, enabling a 24B model to reliably complete matches and facilitating analysis across various models and playstyles.

We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models still play adequately. We also introduce Critical State Analysis: an experimental protocol for rapidly iterating and analyzing key moments in a game at depth. Our harness democratizes the evaluation of strategic reasoning in LLMs by eliminating the need for fine-tuning, and it provides insights into how these capabilities emerge naturally from widely used LLMs. Our code is available in the supplement and will be open sourced.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes