CVDec 20, 2024

CustomTTT: Motion and Appearance Customized Video Generation via Test-Time Training

arXiv:2412.15646v212 citationsh-index: 36
Originality Incremental advance
AI Analysis

This work solves the issue of generating high-quality videos with combined customizations for video creation applications, but it is incremental as it builds on existing fine-tuning methods.

The paper tackles the problem of combining multiple customized concepts (appearance and motion) from different references in text-to-video generation, which causes artifacts, and proposes CustomTTT to address this, resulting in outperforming state-of-the-art methods in evaluations.

Benefiting from large-scale pre-training of text-video pairs, current text-to-video (T2V) diffusion models can generate high-quality videos from the text description. Besides, given some reference images or videos, the parameter-efficient fine-tuning method, i.e. LoRA, can generate high-quality customized concepts, e.g., the specific subject or the motions from a reference video. However, combining the trained multiple concepts from different references into a single network shows obvious artifacts. To this end, we propose CustomTTT, where we can joint custom the appearance and the motion of the given video easily. In detail, we first analyze the prompt influence in the current video diffusion model and find the LoRAs are only needed for the specific layers for appearance and motion customization. Besides, since each LoRA is trained individually, we propose a novel test-time training technique to update parameters after combination utilizing the trained customized models. We conduct detailed experiments to verify the effectiveness of the proposed methods. Our method outperforms several state-of-the-art works in both qualitative and quantitative evaluations.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes