Consistent Generative Query Networks
This addresses the problem of inefficient video prediction for applications requiring flexible frame generation, though it is incremental as it builds on existing generative query networks.
The paper tackles the slow and consecutive-frame limitations of stochastic video prediction models by introducing a model that generates a latent representation from arbitrary frames to sample consistent frames at arbitrary time-points efficiently, achieving substantial speed gains without loss in fidelity in synthetic video evaluations.
Stochastic video prediction models take in a sequence of image frames, and generate a sequence of consecutive future image frames. These models typically generate future frames in an autoregressive fashion, which is slow and requires the input and output frames to be consecutive. We introduce a model that overcomes these drawbacks by generating a latent representation from an arbitrary set of frames that can then be used to simultaneously and efficiently sample temporally consistent frames at arbitrary time-points. For example, our model can "jump" and directly sample frames at the end of the video, without sampling intermediate frames. Synthetic video evaluations confirm substantial gains in speed and functionality without loss in fidelity. We also apply our framework to a 3D scene reconstruction dataset. Here, our model is conditioned on camera location and can sample consistent sets of images for what an occluded region of a 3D scene might look like, even if there are multiple possibilities for what that region might contain. Reconstructions and videos are available at https://bit.ly/2O4Pc4R.