CVAIApr 3, 2025

OmniCam: Unified Multimodal Video Generation via Camera Control

arXiv:2504.02312v17 citationsh-index: 17
Originality Incremental advance
AI Analysis

This work addresses the challenge of precise camera motion control for video generation, which is incremental as it builds on existing methods but offers improved capabilities.

The paper tackles the problem of limited and complex camera control in video generation by introducing OmniCam, a unified multimodal framework that uses large language and video diffusion models to produce spatio-temporally consistent videos, achieving state-of-the-art performance across various metrics.

Camera control, which achieves diverse visual effects by changing camera position and pose, has attracted widespread attention. However, existing methods face challenges such as complex interaction and limited control capabilities. To address these issues, we present OmniCam, a unified multimodal camera control framework. Leveraging large language models and video diffusion models, OmniCam generates spatio-temporally consistent videos. It supports various combinations of input modalities: the user can provide text or video with expected trajectory as camera path guidance, and image or video as content reference, enabling precise control over camera motion. To facilitate the training of OmniCam, we introduce the OmniTr dataset, which contains a large collection of high-quality long-sequence trajectories, videos, and corresponding descriptions. Experimental results demonstrate that our model achieves state-of-the-art performance in high-quality camera-controlled video generation across various metrics.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes