CL LGMar 11, 2025

EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees

Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh

arXiv:2503.08893v226 citationsh-index: 28

AI Analysis

This addresses the need for more actionable and interpretable evaluation methods for language model practitioners, though it appears to be an incremental improvement over existing profiling techniques.

The paper tackles the problem of identifying language model weaknesses by introducing EvalTree, a method that constructs hierarchical capability trees to generate natural language weakness profiles from benchmark performance. The results show that EvalTree outperforms baselines in precision and comprehensiveness on MATH and WildChat benchmarks, and weakness-guided data collection based on its profiles improves model performance more effectively than other strategies.

An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for language model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM's performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also introduce a weakness profiling method EvalTree. EvalTree constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena's human-voter-based evaluation practice. To facilitate future work, we provide an interface that allows practitioners to interactively explore the capability trees built by EvalTree.

View on arXiv PDF

Similar