Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test
For researchers and policymakers concerned with AI safety in geopolitical contexts, this provides a structural benchmark for LLM behavior in alliance coercion scenarios, revealing model-specific escalation tendencies and cultural biases.
This paper uses the Greenland sovereignty crisis (2019-2026) as a stress test for LLM geopolitical behavior, simulating 3,604 games across eight frontier models. Key findings include increased escalation under coercion framing (from 10.7% to 28.6%), systematic differences between Chinese- and Western-origin models, and peaceful acquisition in only 1.9% of games, with DeepSeek V3.2 uniquely achieving it via a stable five-round playbook.
What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.