CLSep 10, 2024

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

arXiv:2409.06820v46 citationsh-index: 7
Originality Synthesis-oriented
AI Analysis

This provides a foundation for robust evaluation of language models in interactive scenarios, though it is incremental as it builds on existing benchmarking approaches.

The authors tackled the problem of evaluating role-playing capabilities in language models by creating a benchmark called PingPong, which uses simulated users and multi-turn conversations to assess over 40 models in English and Russian, showing strong correlation between automated and human evaluations.

We introduce a benchmark for evaluating the role-playing capabilities of language models. Our approach leverages different language models to simulate users in dynamic, multi-turn conversations and assess the resulting dialogues. Our methodology involves three main components: a player model that adopts a specific character role, an interrogator model that simulates user behavior in a specific situation, and a judge model ensemble that evaluates conversation quality with 3 metrics: character consistency, entertainment value, and language fluency. We evaluated more than 40 models in both English and Russian, with each model participating in 64 conversations with 8 characters and 8 situations. We conducted experiments comparing automated evaluations with human annotations to validate our approach, demonstrating strong correlations across multiple criteria. This work provides a foundation for a robust and dynamic evaluation of different model capabilities in interactive scenarios.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes