AICLAug 19, 2024

Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game

arXiv:2408.09946v33 citationsh-index: 3
Originality Incremental advance
AI Analysis

This work provides a more detailed evaluation framework for LLMs in obscured communication tasks, which is incremental but improves assessment for researchers in AI and game-based testing.

The paper tackles the problem of evaluating large language models (LLMs) in social deduction games by addressing coarse-grained metrics and unstructured error analyses, resulting in the introduction of six fine-grained metrics and the identification of four major reasoning failures.

Recent studies have investigated whether large language models (LLMs) can support obscured communication, which is characterized by core aspects such as inferring subtext and evading suspicions. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two limitations with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these limitations, we propose a microscopic and systematic approach to the investigation. Specifically, we introduce six fine-grained metrics that resolve the first issue. To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs' performance in obscured communication.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes