MLLGMay 16, 2020

Transforming variables to central normality

arXiv:2005.07946v280 citations
AI Analysis

This addresses data preprocessing challenges for statisticians and data scientists dealing with skewed real-world data, offering an incremental improvement over existing techniques.

The paper tackles the problem of transforming skewed numerical variables to normality, where standard methods are sensitive to outliers, by proposing a robust modification of Box-Cox and Yeo-Johnson transformations that preserves normality in the central part while allowing outliers to deviate.

Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes