CamLit: Unified Video Diffusion with Explicit Camera and Lighting Control
This work addresses the need for simplified and integrated control over camera and lighting in video generation for applications like virtual reality or content creation, representing a novel integration rather than an incremental improvement.
CamLit tackles the problem of generating videos with controlled camera and lighting from a single image by introducing a unified diffusion model that jointly performs novel view synthesis and relighting, achieving high-fidelity outputs comparable to state-of-the-art methods in both tasks.
We present CamLit, the first unified video diffusion model that jointly performs novel view synthesis (NVS) and relighting from a single input image. Given one reference image, a user-defined camera trajectory, and an environment map, CamLit synthesizes a video of the scene from new viewpoints under the specified illumination. Within a single generative process, our model produces temporally coherent and spatially aligned outputs, including relit novel-view frames and corresponding albedo frames, enabling high-quality control of both camera pose and lighting. Qualitative and quantitative experiments demonstrate that CamLit achieves high-fidelity outputs on par with state-of-the-art methods in both novel view synthesis and relighting, without sacrificing visual quality in either task. We show that a single generative model can effectively integrate camera and lighting control, simplifying the video generation pipeline while maintaining competitive performance and consistent realism.