Explainable Benchmarking through the Lense of Concept Learning
This addresses the tedious and biased manual analysis in benchmarking for researchers and practitioners, though it is incremental as it builds on existing concept learning methods.
The paper tackles the problem of benchmarking systems by introducing explainable benchmarking, which automatically generates explanations for system performance, and demonstrates that their concept learning approach, PruneCEL, outperforms state-of-the-art methods by up to 0.55 F1 points and enables accurate predictions of system behavior in 80% of cases in a user study.
Evaluating competing systems in a comparable way, i.e., benchmarking them, is an undeniable pillar of the scientific method. However, system performance is often summarized via a small number of metrics. The analysis of the evaluation details and the derivation of insights for further development or use remains a tedious manual task with often biased results. Thus, this paper argues for a new type of benchmarking, which is dubbed explainable benchmarking. The aim of explainable benchmarking approaches is to automatically generate explanations for the performance of systems in a benchmark. We provide a first instantiation of this paradigm for knowledge-graph-based question answering systems. We compute explanations by using a novel concept learning approach developed for large knowledge graphs called PruneCEL. Our evaluation shows that PruneCEL outperforms state-of-the-art concept learners on the task of explainable benchmarking by up to 0.55 points F1 measure. A task-driven user study with 41 participants shows that in 80\% of the cases, the majority of participants can accurately predict the behavior of a system based on our explanations. Our code and data are available at https://github.com/dice-group/PruneCEL/tree/K-cap2025