LGDec 28, 2023

Encoding categorical data: Is there yet anything 'hotter' than one-hot encoding?

arXiv:2312.16930v18.844 citationsh-index: 1

Originality Synthesis-oriented

AI Analysis

This research addresses a critical preprocessing issue for data scientists and machine learning practitioners by providing statistically robust evidence to guide encoding choices, though it is incremental as it refines existing claims rather than introducing new methods.

The study tackled the problem of evaluating categorical encoding methods by conducting a comprehensive analysis on a large sample of classification datasets, finding that one-hot and Helmert coding outperform target-based encoders in multiclass tasks, with no significant differences in binary tasks.

Categorical features are present in about 40% of real world problems, highlighting the crucial role of encoding as a preprocessing component. Some recent studies have reported benefits of the various target-based encoders over classical target-agnostic approaches. However, these claims are not supported by any statistical analysis, and are based on a single dataset or a very small and heterogeneous sample of datasets. The present study explores the encoding effects in an exhaustive sample of classification problems from OpenML repository. We fitted linear mixed-effects models to the experimental data, treating task ID as a random effect, and the encoding scheme and the various characteristics of categorical features as fixed effects. We found that in multiclass tasks, one-hot encoding and Helmert contrast coding outperform target-based encoders. In binary tasks, there were no significant differences across the encoding schemes; however, one-hot encoding demonstrated a marginally positive effect on the outcome. Importantly, we found no significant interactions between the encoding schemes and the characteristics of categorical features. This suggests that our findings are generalizable to a wide variety of problems across domains.

View on arXiv PDF

Similar