LGNov 8, 2021

A Novel Data Pre-processing Technique: Making Data Mining Robust to Different Units and Scales of Measurement

Arbind Agrahari Baniya, Sunil Aryal, Santosh KC

arXiv:2111.04253v11.61 citations

Originality Incremental advance

AI Analysis

This addresses a practical issue for data miners by providing a robust pre-processing method, though it appears incremental as it builds on existing rank transformation approaches.

The paper tackles the problem of data mining algorithms being sensitive to units and scales by proposing ARES, a pre-processing technique based on average ranks over sub-samples, which results in more consistent and often better outcomes compared to min-max normalization and rank transformation across various datasets and algorithms.

Many existing data mining algorithms use feature values directly in their model, making them sensitive to units/scales used to measure/represent data. Pre-processing of data based on rank transformation has been suggested as a potential solution to overcome this issue. However, the resulting data after pre-processing with rank transformation is uniformly distributed, which may not be very useful in many data mining applications. In this paper, we present a better and effective alternative based on ranks over multiple sub-samples of data. We call the proposed pre-processing technique as ARES | Average Rank over an Ensemble of Sub-samples. Our empirical results of widely used data mining algorithms for classification and anomaly detection in a wide range of data sets suggest that ARES results in more consistent task specific? outcome across various algorithms and data sets. In addition to this, it results in better or competitive outcome most of the time compared to the most widely used min-max normalisation and the traditional rank transformation.

View on arXiv PDF

Similar