LG AIFeb 19, 2024

Imbalance in Regression Datasets

Daniel Kowatsch, Nicolas M. Müller, Kilian Tscharke, Philip Sperl, Konstantin Bötinger

arXiv:2402.11963v14.63 citationsh-index: 8

Originality Highly original

AI Analysis

It addresses a foundational issue in machine learning for regression tasks, potentially improving model reliability across domains, but is incremental as it extends known concepts from classification.

The paper identifies imbalance in regression datasets as a significant but overlooked problem, where regressors degenerate to naive models due to under- and over-representations in target distributions, and proposes a first definition of imbalance in regression as a generalization of classification measures.

For classification, the problem of class imbalance is well known and has been extensively studied. In this paper, we argue that imbalance in regression is an equally important problem which has so far been overlooked: Due to under- and over-representations in a data set's target distribution, regressors are prone to degenerate to naive models, systematically neglecting uncommon training data and over-representing targets seen often during training. We analyse this problem theoretically and use resulting insights to develop a first definition of imbalance in regression, which we show to be a generalisation of the commonly employed imbalance measure in classification. With this, we hope to turn the spotlight on the overlooked problem of imbalance in regression and to provide common ground for future research.

View on arXiv PDF

Similar