MA LGOct 31, 2024

Soft Condorcet Optimization for Ranking of General Agents

Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, Doina Precup

arXiv:2411.00119v45.96 citationsh-index: 41AAMAS

Originality Incremental advance

AI Analysis

This provides a method for aggregating performance across tasks to rank AI agents, which is incremental as it builds on existing voting theory but applies it to a new context in AI evaluation.

The paper tackles the problem of ranking general AI agents across diverse tasks by proposing Soft Condorcet Optimization (SCO), a novel ranking scheme based on social choice theory that minimizes mistakes in predicting agent comparisons, achieving results such as being 0 to 0.043 away from optimal rankings in normalized Kendall-tau distance and performing best with 59% missing data in simulations.

Driving progress of AI models and agents requires comparing their performance on standardized benchmarks; for general agents, individual performances must be aggregated across a potentially wide variety of different tasks. In this paper, we describe a novel ranking scheme inspired by social choice frameworks, called Soft Condorcet Optimization (SCO), to compute the optimal ranking of agents: the one that makes the fewest mistakes in predicting the agent comparisons in the evaluation data. This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria. SCO ratings are maximal for Condorcet winners when they exist, which we show is not necessarily true for the classical rating system Elo. We propose three optimization algorithms to compute SCO ratings and evaluate their empirical performance. When serving as an approximation to the Kemeny-Young voting method, SCO rankings are on average 0 to 0.043 away from the optimal ranking in normalized Kendall-tau distance across 865 preference profiles from the PrefLib open ranking archive. In a simulated noisy tournament setting, SCO achieves accurate approximations to the ground truth ranking and the best among several baselines when 59\% or more of the preference data is missing. Finally, SCO ranking provides the best approximation to the optimal ranking, measured on held-out test sets, in a problem containing 52,958 human players across 31,049 games of the classic seven-player game of Diplomacy.

View on arXiv PDF

Similar