LG IR MLAug 13, 2016

An approach to dealing with missing values in heterogeneous data using k-nearest neighbors

Davi E. N. Frossard, Igor O. Nunes, Renato A. Krohling

arXiv:1608.04037v11.04 citations

Originality Synthesis-oriented

AI Analysis

This addresses a common issue in real-world data analysis for researchers and practitioners, but it is incremental as it adapts an existing method to handle specific data types.

The paper tackles the problem of missing values in heterogeneous data, which can bias results in techniques like clustering and neural networks, by proposing a k-nearest neighbors imputation method that handles crisp, interval, and fuzzy data, with promising numerical results on several datasets.

Techniques such as clusterization, neural networks and decision making usually rely on algorithms that are not well suited to deal with missing values. However, real world data frequently contains such cases. The simplest solution is to either substitute them by a best guess value or completely disregard the missing values. Unfortunately, both approaches can lead to biased results. In this paper, we propose a technique for dealing with missing values in heterogeneous data using imputation based on the k-nearest neighbors algorithm. It can handle real (which we refer to as crisp henceforward), interval and fuzzy data. The effectiveness of the algorithm is tested on several datasets and the numerical results are promising.

View on arXiv PDF

Similar