LGMar 24, 2015

Comparing published multi-label classifier performance measures to the ones obtained by a simple multi-label baseline classifier

Jean Metz, Newton Spolaôr, Everton A. Cherman, Maria C. Monard

arXiv:1503.06952v15 citations

Originality Synthesis-oriented

AI Analysis

This work highlights a methodological gap in multi-label learning by showing that many published classifiers fail to outperform a simple baseline, urging the community to adopt baseline comparisons for better evaluation.

The authors proposed General_B, a simple multi-label baseline classifier, and compared it to published results on 10 datasets, finding that many published classifiers performed worse than or equal to it, with up to 43% of results on one dataset being inferior, and noted a lack of explanations for poor performance.

In supervised learning, simple baseline classifiers can be constructed by only looking at the class, i.e., ignoring any other information from the dataset. The single-label learning community frequently uses as a reference the one which always predicts the majority class. Although a classifier might perform worse than this simple baseline classifier, this behaviour requires a special explanation. Aiming to motivate the community to compare experimental results with the ones provided by a multi-label baseline classifier, calling the attention about the need of special explanations related to classifiers which perform worse than the baseline, in this work we propose the use of General_B, a multi-label baseline classifier. General_B was evaluated in contrast to results published in the literature which were carefully selected using a systematic review process. It was found that a considerable number of published results on 10 frequently used datasets are worse than or equal to the ones obtained by General_B, and for one dataset it reaches up to 43% of the dataset published results. Moreover, although a simple baseline classifier was not considered in these publications, it was observed that even for very poor results no special explanations were provided in most of them. We hope that the findings of this work would encourage the multi-label community to consider the idea of using a simple baseline classifier, such that further explanations are provided when a classifiers performs worse than a baseline.

View on arXiv PDF

Similar