LGOct 13, 2023

Target Variable Engineering

arXiv:2310.09440v11 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This work addresses efficiency and replicability issues in ML pipelines for researchers and practitioners, though it is incremental as it systematically compares existing methods.

The study investigated how formulating a target variable as numeric (regression) versus binarized (classification) affects performance in ML pipelines, finding that regression requires significantly more computational effort to converge and is more sensitive to randomness and heuristic choices.

How does the formulation of a target variable affect performance within the ML pipeline? The experiments in this study examine numeric targets that have been binarized by comparing against a threshold. We compare the predictive performance of regression models trained to predict the numeric targets vs. classifiers trained to predict their binarized counterparts. Specifically, we make this comparison at every point of a randomized hyperparameter optimization search to understand the effect of computational resource budget on the tradeoff between the two. We find that regression requires significantly more computational effort to converge upon the optimal performance, and is more sensitive to both randomness and heuristic choices in the training process. Although classification can and does benefit from systematic hyperparameter tuning and model selection, the improvements are much less than for regression. This work comprises the first systematic comparison of regression and classification within the framework of computational resource requirements. Our findings contribute to calls for greater replicability and efficiency within the ML pipeline for the sake of building more sustainable and robust AI systems.

View on arXiv PDF

Similar