CLJan 3, 2025

MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments

arXiv:2501.01652v24 citationsh-index: 10Has CodeACL
Originality Incremental advance
AI Analysis

This work addresses the need for better assessment tools for AI researchers and developers to understand LLMs' limitations in social interactions, though it is incremental as it focuses on a specific evaluation framework.

The paper tackles the problem of evaluating large language models' ability to simulate complex human behaviors in interactive role-playing environments, specifically using murder mystery games, and finds that even advanced models like GPT-4 struggle significantly with the challenges presented.

Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs' proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs' performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs' capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script Compliance Index (SCI) to assess LLMs' capability of understanding and following instructions. Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE. The datasets and simulation codes are available in \href{https://github.com/lime728/MIRAGE}{github}.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes