CVAIMay 4, 2025

DualReal: Adaptive Joint Training for Lossless Identity-Motion Fusion in Video Customization

arXiv:2505.02192v28 citationsh-index: 9
Originality Incremental advance
AI Analysis

This addresses the challenge of generating consistent videos with customized subjects and motions for users in video synthesis, though it appears incremental as it builds on existing paradigms.

The paper tackles the problem of identity-motion conflicts in customized text-to-video generation by introducing DualReal, a framework that uses adaptive joint training to fuse identity and motion patterns, resulting in improvements of 21.7% in CLIP-I and 31.8% in DINO-I metrics on average.

Customized text-to-video generation with pre-trained large-scale models has recently garnered significant attention by focusing on identity and motion consistency. Existing works typically follow the isolated customized paradigm, where the subject identity or motion dynamics are customized exclusively. However, this paradigm completely ignores the intrinsic mutual constraints and synergistic interdependencies between identity and motion, resulting in identity-motion conflicts throughout the generation process that systematically degrade. To address this, we introduce DualReal, a novel framework that employs adaptive joint training to construct interdependencies between dimensions collaboratively. Specifically, DualReal is composed of two units: (1) Dual-aware Adaptation dynamically switches the training step (i.e., identity or motion), learns the current information guided by the frozen dimension prior, and employs a regularization strategy to avoid knowledge leakage; (2) StageBlender Controller leverages the denoising stages and Diffusion Transformer depths to guide different dimensions with adaptive granularity, avoiding conflicts at various stages and ultimately achieving lossless fusion of identity and motion patterns. We constructed a more comprehensive evaluation benchmark than existing methods. The experimental results show that DualReal improves CLIP-I and DINO-I metrics by 21.7% and 31.8% on average, and achieves top performance on nearly all motion metrics. Page: https://wenc-k.github.io/dualreal-customization

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes