OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
This addresses the need for advanced, open-source video generation tools for researchers and developers, though it is incremental as it builds on existing proprietary systems.
The paper tackles the problem of fragmented and limited open-source video generation models by proposing OmniWeaving, a unified model that integrates diverse tasks with multimodal composition and reasoning, achieving state-of-the-art performance among open-source unified models.
While proprietary systems such as Seedance-2.0 have achieved remarkable success in omni-capable video generation, open-source alternatives significantly lag behind. Most academic models remain heavily fragmented, and the few existing efforts toward unified video generation still struggle to seamlessly integrate diverse tasks within a single framework. To bridge this gap, we propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models. The codes and model will be made publicly available soon. Project Page: https://omniweaving.github.io.