CV AI LGMay 7, 2024

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Hritik Bansal, Yonatan Bitton, Michal Yarom, Idan Szpektor, Aditya Grover, Kai-Wei Chang

arXiv:2405.04682v419.827 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This addresses a limitation in text-to-video generation for applications requiring coherent multi-scene narratives, though it is incremental as it builds on pretrained models.

The paper tackles the problem of generating multi-scene videos from text descriptions, which existing text-to-video models struggle with, by introducing the TALC framework that enhances text-conditioning to align scenes temporally, resulting in a 29% relative gain in overall score compared to baselines.

Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos since they are ubiquitous in the real-world (e.g., 'a red panda climbing a tree' followed by 'the red panda sleeps on the top of the tree'). To generate multi-scene videos from the pretrained T2V model, we introduce a simple and effective Time-Aligned Captions (TALC) framework. Specifically, we enhance the text-conditioning mechanism in the T2V architecture to recognize the temporal alignment between the video scenes and scene descriptions. For instance, we condition the visual features of the earlier and later scenes of the generated video with the representations of the first scene description (e.g., 'a red panda climbing a tree') and second scene description (e.g., 'the red panda sleeps on the top of the tree'), respectively. As a result, we show that the T2V model can generate multi-scene videos that adhere to the multi-scene text descriptions and be visually consistent (e.g., entity and background). Further, we finetune the pretrained T2V model with multi-scene video-text data using the TALC framework. We show that the TALC-finetuned model outperforms the baseline by achieving a relative gain of 29% in the overall score, which averages visual consistency and text adherence using human evaluation.

View on arXiv PDF Code

Similar