Coordinates of Capability: A Unified MTMM-Geometric Framework for LLM Evaluation
For researchers and practitioners evaluating LLMs, this work offers a systematic approach to separate method variance from true capabilities, improving the validity and stability of benchmarks.
This paper addresses the lack of construct validity in LLM evaluation by proposing a unified MTMM-geometric framework that interprets nine metrics as geometric measurements in a latent coordinate space, factorizing model behavior into three orthogonal dimensions. The framework provides a theoretically grounded taxonomy for robust benchmark design.
The evaluation of Large Language Models (LLMs) faces a critical challenge in construct validity, where fragmented benchmarks and ad hoc metrics frequently conflate method variance, such as prompt sensitivity, with true latent capabilities. Concurrently, emerging research suggests that LLM capabilities and outputs can be modeled as continuous geometric manifolds. In this Systematization of Knowledge (SoK), we bridge these paradigms by proposing a generalized Multi-Trait Multi-Method (MTMM) framework for LLM evaluation. We formalize and unify nine evaluation metrics, including Paraphrase Instability, Drift Score, Overton Width, and Pluralism Score, interpreting them not as isolated scalar values but as geometric measurements within a shared latent coordinate space. This spatial unification factorizes model behavior into three orthogonal latent dimensions: (1) Instability and Sensitivity, (2) Position and Alignment, and (3) Coverage and Expressiveness. By systematically separating task-irrelevant perturbations from true capability spans, the framework provides a theoretically grounded and domain-agnostic taxonomy for robust and empirically stable benchmark design.