CLAIJun 12, 2024

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

arXiv:2406.08598v415 citations
Originality Incremental advance
AI Analysis

This addresses the problem of intra-model bias in LLM evaluation for subjective tasks like emotional intelligence, offering a more inclusive and reliable benchmarking method, though it is incremental in improving existing panel-based approaches.

The paper tackles the challenge of evaluating large language models (LLMs) on subjective tasks by introducing the Language Model Council (LMC), where multiple LLMs collaborate democratically to create tests, respond, and evaluate each other, resulting in rankings that are more separable, robust, and consistent with human evaluations than single-model judges.

As Large Language Models (LLMs) continue to evolve, evaluating them remains a persistent challenge. Many recent evaluations use LLMs as judges to score outputs from other LLMs, often relying on a single large model like GPT-4o. However, using a single LLM judge is prone to intra-model bias, and many tasks - such as those related to emotional intelligence, creative writing, and persuasiveness - may be too subjective for a single model to judge fairly. We introduce the Language Model Council (LMC), where a group of LLMs collaborate to create tests, respond to them, and evaluate each other's responses to produce a ranking in a democratic fashion. Unlike previous approaches that focus on reducing cost or bias by using a panel of smaller models, our work examines the benefits and nuances of a fully inclusive LLM evaluation system. In a detailed case study on emotional intelligence, we deploy a council of 20 recent LLMs to rank each other on open-ended responses to interpersonal conflicts. Our results show that the LMC produces rankings that are more separable and more robust, and through a user study, we show that they are more consistent with human evaluations than any individual LLM judge. Using all LLMs for judging can be costly, however, so we use Monte Carlo simulations and hand-curated sub-councils to study hypothetical council compositions and discuss the value of the incremental LLM judge.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes