CVMay 22, 2024

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

arXiv:2405.13951v13 citationsh-index: 23Has Code
Originality Incremental advance
AI Analysis

This addresses the challenge of multi-concept video customization for creative and media applications, representing an incremental advance over existing single-concept methods.

The paper tackles the problem of generating videos with multiple custom concepts (subjects, actions, backgrounds) using pretrained text-to-video models, achieving results like a teddy bear running towards a brown teapot as evaluated with videoCLIP and DINO scores and human assessment.

We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented in this paper can be found at https://github.com/divyakraman/MultiConceptVideo2024.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes