Layer-Aware Video Composition via Split-then-Merge
This addresses the data scarcity and control issues in video generation for applications like content creation, though it appears incremental as it builds on existing generative methods.
The paper tackles the problem of generative video composition by proposing the Split-then-Merge framework, which splits unlabeled videos into layers and self-composes them to learn compositional dynamics, resulting in outperforming state-of-the-art methods in quantitative and qualitative evaluations.
We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io