FlexiFilm: Long Video Generation with Flexible Conditions
This work addresses the challenge of long video generation for applications in media and AI, representing an incremental improvement over existing methods.
The paper tackles the problem of generating long and consistent videos, which suffers from temporal inconsistency and overexposure in existing diffusion-based models, and introduces FlexiFilm, a new diffusion model that generates videos over 30 seconds long with improved performance over competitors.
Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/