Hierarchical Video Generation for Complex Data
This addresses the challenge of video generation for complex data, enabling high-resolution and longer sequences, though it appears incremental as it builds on hierarchical and coarse-to-fine concepts.
The paper tackles the problem of generating high-resolution, long-duration videos by proposing a hierarchical model that follows a coarse-to-fine approach, generating low-resolution videos first and then refining them through subsequent levels. The result is a three-level model that generates 256x256 videos with 48 frames on datasets like Kinetics-600 and BDD100K, reducing computational complexity to scale beyond a few frames.
Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.