CLLGMay 28

Latent Performance Profiling of Large Language Models

arXiv:2605.3001888.6Has Code
Predicted impact top 60% in CL · last 90 daysOriginality Incremental advance
AI Analysis

For researchers and practitioners evaluating LLMs, LPP addresses the limitations of benchmark-centric evaluation by providing intrinsic, interpretable metrics that complement accuracy scores.

The paper proposes Latent Performance Profiling (LPP), a framework that extracts task-agnostic diagnostics from hidden activations and output distributions of LLMs, revealing scale-independent traits and hidden vulnerabilities. Across eight LLMs (0.5B-14B), LPP shows that models with similar benchmark scores can have contrasting latent profiles, enabling deeper interpretability and more reliable model selection.

Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture \textit{what} a model outputs on fixed test sets, not \textit{how} it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, \textit{state-centered intrinsic assessment} of LLMs. To this end, we introduce \textbf{Latent Performance Profiling (LPP)} -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes