JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation
This addresses the challenge of realistic video generation for applications in media and AI, representing an incremental advancement by combining existing diffusion models.
The paper tackled the problem of generating high-quality and temporally coherent videos by introducing the Joint Video-Image Diffusion model (JVID), which integrates image and video diffusion models to enhance visual quality and ensure consistency, resulting in quantitative and qualitative improvements.
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.