ML LGFeb 10, 2020

Missing Data Imputation using Optimal Transport

Boris Muzellec, Julie Josse, Claire Boyer, Marco Cuturi

arXiv:2002.03860v327.4181 citationsHas Code

Originality Incremental advance

AI Analysis

This addresses a common data quality problem in real-world ML applications, though it appears incremental as it builds on existing optimal transport frameworks.

The paper tackles missing data imputation by using optimal transport distances to enforce distributional consistency between data batches, achieving performance that matches or exceeds state-of-the-art methods across various missing data scenarios, including high missing percentages.

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

View on arXiv PDF Code

Similar