VividCam: Learning Unconventional Camera Motions from Virtual Synthetic Videos
This addresses a bottleneck in creating artistic videos for content creators, but is incremental as it builds on existing diffusion models with new training data and disentanglement strategies.
The paper tackles the problem of text-to-video models struggling to generalize to unconventional camera motions due to insufficient training data, and proposes VividCam, a training paradigm that uses synthetic videos to enable diffusion models to learn complex camera motions, achieving precise control and wide range synthesis.
Although recent text-to-video generative models are getting more capable of following external camera controls, imposed by either text descriptions or camera trajectories, they still struggle to generalize to unconventional camera motions, which is crucial in creating truly original and artistic videos. The challenge lies in the difficulty of finding sufficient training videos with the intended uncommon camera motions. To address this challenge, we propose VividCam, a training paradigm that enables diffusion models to learn complex camera motions from synthetic videos, releasing the reliance on collecting realistic training videos. VividCam incorporates multiple disentanglement strategies that isolates camera motion learning from synthetic appearance artifacts, ensuring more robust motion representation and mitigating domain shift. We demonstrate that our design synthesizes a wide range of precisely controlled and complex camera motions using surprisingly simple synthetic data. Notably, this synthetic data often consists of basic geometries within a low-poly 3D scene and can be efficiently rendered by engines like Unity. Our video results can be found in https://wuqiuche.github.io/VividCamDemoPage/ .