Evaluating Classification Systems Against Soft Labels with Fuzzy Precision and Recall
This addresses the issue of erroneous interpretations from binarizing soft labels in classification evaluation, which is incremental for domains like sound event detection where non-binary references are common.
The paper tackles the problem of evaluating classification systems when reference labels are non-binary (soft), by introducing a novel method to calculate precision, recall, and F-score without quantizing the data, and demonstrates its application on sound event detection models trained with soft labels.
Classification systems are normally trained by minimizing the cross-entropy between system outputs and reference labels, which makes the Kullback-Leibler divergence a natural choice for measuring how closely the system can follow the data. Precision and recall provide another perspective for measuring the performance of a classification system. Non-binary references can arise from various sources, and it is often beneficial to use the soft labels for training instead of the binarized data. However, the existing definitions for precision and recall require binary reference labels, and binarizing the data can cause erroneous interpretations. We present a novel method to calculate precision, recall and F-score without quantizing the data. The proposed metrics extend the well established metrics as the definitions coincide when used with binary labels. To understand the behavior of the metrics we show simple example cases and an evaluation of different sound event detection models trained on real data with soft labels.