CVJul 16, 2024

Contrastive Sequential-Diffusion Learning: Non-linear and Multi-Scene Instructional Video Synthesis

Vasco Ramos, Yonatan Bitton, Michal Yarom, Idan Szpektor, Joao Magalhaes

arXiv:2407.11814v37.64 citationsh-index: 37Has Code

Originality Incremental advance

AI Analysis

This addresses the challenge of creating coherent instructional videos for applications like recipes and DIY projects, though it appears incremental as it builds on existing diffusion methods.

The paper tackled the problem of generating multi-scene instructional videos with non-linear visual consistency, where scenes must align with earlier ones rather than just the immediate predecessor, and achieved improved consistency in experiments with real-world action-centered data.

Generated video scenes for action-centric sequence descriptions, such as recipe instructions and do-it-yourself projects, often include non-linear patterns, where the next video may need to be visually consistent not with the immediately preceding video but with earlier ones. Current multi-scene video synthesis approaches fail to meet these consistency requirements. To address this, we propose a contrastive sequential video diffusion method that selects the most suitable previously generated scene to guide and condition the denoising process of the next scene. The result is a multi-scene video that is grounded in the scene descriptions and coherent w.r.t. the scenes that require visual consistency. Experiments with action-centered data from the real world demonstrate the practicality and improved consistency of our model compared to previous work.

View on arXiv PDF Code

Similar