Parallel Multiscale Autoregressive Density Estimation
This addresses the inference bottleneck for researchers and practitioners using autoregressive models in image and video generation, though it is incremental as it builds directly on PixelCNN.
The paper tackles the slow inference of PixelCNN by proposing a parallelized version that models pixel groups as conditionally independent, achieving competitive density estimation and orders of magnitude speedup from O(N) to O(log N) sampling, enabling practical generation of 512x512 images.
PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; O(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical generation of 512x512 images. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.