Disentangled VAE Representations for Multi-Aspect and Missing Data
This addresses the challenge of handling incomplete multi-view or multi-modal data for applications in machine learning, though it appears incremental as it builds on existing VAE frameworks.
The paper tackled the problem of conditional modeling and sampling in multi-aspect data with missing observations by developing factVAE, a deep generative model that demonstrated effectiveness on real-world datasets like motion capture poses and facial images.
Many problems in machine learning and related application areas are fundamentally variants of conditional modeling and sampling across multi-aspect data, either multi-view, multi-modal, or simply multi-group. For example, sampling from the distribution of English sentences conditioned on a given French sentence or sampling audio waveforms conditioned on a given piece of text. Central to many of these problems is the issue of missing data: we can observe many English, French, or German sentences individually but only occasionally do we have data for a sentence pair. Motivated by these applications and inspired by recent progress in variational autoencoders for grouped data, we develop factVAE, a deep generative model capable of handling multi-aspect data, robust to missing observations, and with a prior that encourages disentanglement between the groups and the latent dimensions. The effectiveness of factVAE is demonstrated on a variety of rich real-world datasets, including motion capture poses and pictures of faces captured from varying poses and perspectives.