Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type Data
It addresses the practical problem of identifying and repairing corrupted cells in mixed-type datasets, which is incremental over traditional row-level outlier detection.
The paper tackles unsupervised cell outlier detection and repair in mixed-type tabular data by introducing the Robust Variational Autoencoder (RVAE), which learns the joint distribution of clean data to identify and impute outlier cells, showing better performance than state-of-the-art methods.
We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE), a deep generative model that learns the joint distribution of the clean data while identifying the outlier cells, allowing their imputation (repair). RVAE explicitly learns the probability of each cell being an outlier, balancing different likelihood models in the row outlier score, making the method suitable for outlier detection in mixed-type datasets. We show experimentally that not only RVAE performs better than several state-of-the-art methods in cell outlier detection and repair for tabular data, but also that is robust against the initial hyper-parameter selection.