LGJul 15, 2019
Robust Variational Autoencoders for Outlier Detection and Repair of Mixed-Type DataSimão Eduardo, Alfredo Nazábal, Christopher K. I. Williams et al.
We focus on the problem of unsupervised cell outlier detection and repair in mixed-type tabular data. Traditional methods are concerned only with detecting which rows in the dataset are outliers. However, identifying which cells are corrupted in a specific row is an important problem in practice, and the very first step towards repairing them. We introduce the Robust Variational Autoencoder (RVAE), a deep generative model that learns the joint distribution of the clean data while identifying the outlier cells, allowing their imputation (repair). RVAE explicitly learns the probability of each cell being an outlier, balancing different likelihood models in the row outlier score, making the method suitable for outlier detection in mixed-type datasets. We show experimentally that not only RVAE performs better than several state-of-the-art methods in cell outlier detection and repair for tabular data, but also that is robust against the initial hyper-parameter selection.
LGJan 11, 2018
Autoencoders and Probabilistic Inference with Missing Data: An Exact Solution for The Factor Analysis CaseChristopher K. I. Williams, Charlie Nash, Alfredo Nazábal
Latent variable models can be used to probabilistically "fill-in" missing data entries. The variational autoencoder architecture (Kingma and Welling, 2014; Rezende et al., 2014) includes a "recognition" or "encoder" network that infers the latent variables given the data variables. However, it is not clear how to handle missing data variables in this network. The factor analysis (FA) model is a basic autoencoder, using linear encoder and decoder networks. We show how to calculate exactly the latent posterior distribution for the factor analysis (FA) model in the presence of missing data, and note that this solution implies that a different encoder network is required for each pattern of missingness. We also discuss various approximations to the exact solution. Experiments compare the effectiveness of various approaches to filling in the missing data.