SE AIOct 28, 2024

Project MPG: towards a generalized performance benchmark for LLM capabilities

Lucas Spangher, Tianle Li, William F. Arnold, Nick Masiewicki, Xerxes Dotiwalla, Rama Parusmathi, Peter Grabowski, Eugene Ie, Dan Gruhl

arXiv:2410.22368v13.33 citationsh-index: 6

Originality Synthesis-oriented

AI Analysis

This provides a practical tool for non-experts to compare LLMs, though it is incremental as it builds on existing benchmarking methods.

The authors tackled the problem of aggregating diverse LLM benchmarks into a single actionable metric for non-experts by proposing Project MPG, which combines 'Goodness' (accuracy) and 'Fastness' (cost/QPS) scores, resulting in improved correlation with Chatbot Arena compared to MMLU.

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a single number is the most actionable for decision-making, especially by non-experts. No such aggregation schema exists that is not Elo-based, which could be costly or time-consuming. Here we propose a method to aggregate performance across a general space of benchmarks, nicknamed Project "MPG," dubbed Model Performance and Goodness, additionally referencing a metric widely understood to be an important yet inaccurate and crude measure of car performance. Here, we create two numbers: a "Goodness" number (answer accuracy) and a "Fastness" number (cost or QPS). We compare models against each other and present a ranking according to our general metric as well as subdomains. We find significant agreement between the raw Pearson correlation of our scores and those of Chatbot Arena, even improving on the correlation of the MMLU leaderboard to Chatbot Arena.

View on arXiv PDF

Similar