Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms
This addresses the gap between real-world performance and standardized evaluations for machine learning practitioners, though it is incremental as it builds on psychometric theories and existing diagnostic methods.
The paper tackles the problem of insufficient evaluation metrics for machine learning algorithms by proposing Camilla, a task-agnostic framework that defines a multi-dimensional diagnostic metric called Ability to measure algorithm strengths, and it demonstrates improved precision and outperforms baselines in reliability, consistency, and stability on four public datasets.
Machine learning algorithms have become ubiquitous in a number of applications (e.g. image classification). However, due to the insufficient measurement of traditional metrics (e.g. the coarse-grained Accuracy of each classifier), substantial gaps are usually observed between the real-world performance of these algorithms and their scores in standardized evaluations. In this paper, inspired by the psychometric theories from human measurement, we propose a task-agnostic evaluation framework Camilla, where a multi-dimensional diagnostic metric Ability is defined for collaboratively measuring the multifaceted strength of each machine learning algorithm. Specifically, given the response logs from different algorithms to data samples, we leverage cognitive diagnosis assumptions and neural networks to learn the complex interactions among algorithms, samples and the skills (explicitly or implicitly pre-defined) of each sample. In this way, both the abilities of each algorithm on multiple skills and some of the sample factors (e.g. sample difficulty) can be simultaneously quantified. We conduct extensive experiments with hundreds of machine learning algorithms on four public datasets, and our experimental results demonstrate that Camilla not only can capture the pros and cons of each algorithm more precisely, but also outperforms state-of-the-art baselines on the metric reliability, rank consistency and rank stability.