Deep Variational Inference Without Pixel-Wise Reconstruction
This addresses a known bottleneck in VAEs for computer vision researchers, offering an incremental improvement by replacing pixel-wise reconstruction with a more efficient method.
The paper tackles the problem of pixel-wise reconstruction in variational autoencoders (VAEs) by using real-valued non-volume preserving transformations (real NVP) to compute conditional likelihood exactly, showing that this simple VAE is competitive with complex structures on image modeling tasks.
Variational autoencoders (VAEs), that are built upon deep neural networks have emerged as popular generative models in computer vision. Most of the work towards improving variational autoencoders has focused mainly on making the approximations to the posterior flexible and accurate, leading to tremendous progress. However, there have been limited efforts to replace pixel-wise reconstruction, which have known shortcomings. In this work, we use real-valued non-volume preserving transformations (real NVP) to exactly compute the conditional likelihood of the data given the latent distribution. We show that a simple VAE with this form of reconstruction is competitive with complicated VAE structures, on image modeling tasks. As part of our model, we develop powerful conditional coupling layers that enable real NVP to learn with fewer intermediate layers.