AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue
This work addresses the need for dynamic benchmarks to assess LLM capabilities in strategic reasoning, though it is incremental in focusing on specific dialogue tasks.
The paper tackled the problem of evaluating strategic reasoning in Large Language Models (LLMs) by introducing AIDG, a game-theoretic framework for multi-turn dialogue, and found a clear asymmetry where models perform substantially better at information containment than deduction, with a 350 ELO advantage on defense.
Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.