CLApr 10, 2025

Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric

arXiv:2504.07440v37 citationsh-index: 12Has Code
Originality Incremental advance
AI Analysis

This work addresses the generalization issue in LLM evaluation for researchers and practitioners, offering a novel metric to complement traditional benchmarks, though it is incremental in enhancing existing evaluation methods.

The paper tackles the challenge of evaluating LLMs beyond performance by proposing the Model Utilization Index (MUI), a mechanism interpretability metric that quantifies model effort based on activated neurons, revealing an inverse logarithmic relationship with performance and deriving practical corollaries for training diagnostics and model comparisons.

Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model's near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes