Correcting Performance Estimation Bias in Imbalanced Classification with Minority Subconcepts
For practitioners evaluating classifiers on imbalanced data with hidden subpopulations, this work offers a practical method to correct evaluation bias without requiring subconcept labels at test time.
The paper addresses bias in class-level evaluation metrics for imbalanced classification when minority subconcepts exist within a class. It proposes predicted-weighted balanced accuracy (pBA), which uses predicted posterior probabilities from a subconcept model to weight evaluation, and shows it provides more stable and interpretable assessments than unweighted metrics on tabular, medical-imaging, and text datasets.
Class-level evaluation can conceal substantial performance disparities across subconcepts within the same class, causing models that perform well on average to fail on specific subpopulations. Prior work has shown that common evaluation measures for imbalanced classification are biased toward larger minority subconcepts and that utility-based reweighting using true subconcept labels can mitigate this bias; however, such labels are rarely available at test time. We introduce a practical utility-weighted evaluation that replaces unavailable subconcept labels with predicted posterior probabilities from a multiclass subconcept model. Evaluation weights are defined as the expected utility under this posterior, yielding a soft, uncertainty-aware metric we call predicted-weighted balanced accuracy (pBA). Experiments on tabular benchmarks as well as medical-imaging and text datasets show that unweighted scores can be misleading under within-class heterogeneity, while pBA provides more stable and interpretable assessments when subconcept distributions are uneven but not pathological. Our code is available at: https://anonymous.4open.science/r/correcting-bias-imbalance-9C6C/.