Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion
This work addresses the need for accurate object motion prediction in domains like autonomous driving, representing an incremental improvement in controllable video generation.
The paper tackles precise control over object motion in video synthesis by using bounding boxes to guide movements and a specialized model for trajectory forecasting, achieving realistic and controllable video generation validated on multiple datasets.
Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This paper tackles a crucial challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we 1) control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space, 2) employ a distinct, specialized model to forecast the trajectories of object bounding boxes based on their previous and, if desired, future positions, and 3) adapt and enhance a separate video diffusion network to create video content based on these high quality trajectory forecasts. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation. Extensive experiments conducted on the KITTI, Virtual-KITTI 2, BDD100k, and nuScenes datasets validate the effectiveness of our approach in producing realistic and controllable video generation.