CVJan 7, 2025

Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers

arXiv:2501.03931v131 citationsh-index: 15Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of maintaining identity consistency in text-to-video generation for applications like personalized content creation, though it is incremental as it builds on existing Video Diffusion Transformers.

The paper tackles the problem of generating identity-preserved videos with consistent identity and natural motion in video diffusion models, achieving improved performance across multiple metrics with minimal added parameters.

We present Magic Mirror, a framework for generating identity-preserved videos with cinematic-level quality and dynamic motion. While recent advances in video diffusion models have shown impressive capabilities in text-to-video generation, maintaining consistent identity while producing natural motion remains challenging. Previous methods either require person-specific fine-tuning or struggle to balance identity preservation with motion diversity. Built upon Video Diffusion Transformers, our method introduces three key components: (1) a dual-branch facial feature extractor that captures both identity and structural features, (2) a lightweight cross-modal adapter with Conditioned Adaptive Normalization for efficient identity integration, and (3) a two-stage training strategy combining synthetic identity pairs with video data. Extensive experiments demonstrate that Magic Mirror effectively balances identity consistency with natural motion, outperforming existing methods across multiple metrics while requiring minimal parameters added. The code and model will be made publicly available at: https://github.com/dvlab-research/MagicMirror/

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes