CVDec 7, 2025

Scaling Zero-Shot Reference-to-Video Generation

Zijian Zhou, Shikun Liu, Haozhe Liu, Haonan Qiu, Zhaochong An, Weiming Ren, Zhiheng Liu, Xiaoke Huang, Kam Woh Ng, Tian Xie, Xiao Han, Yuren Cong

arXiv:2512.06905v18 citationsh-index: 11

Originality Highly original

AI Analysis

This addresses the scalability bottleneck in reference-to-video generation for AI video synthesis applications, representing a novel method rather than an incremental improvement.

The paper tackles the problem of expensive and difficult-to-scale explicit reference image-video-text triplets in reference-to-video generation by introducing Saber, a scalable zero-shot framework that requires no such data, achieving superior performance on the OpenS2V-Eval benchmark.

Reference-to-video (R2V) generation aims to synthesize videos that align with a text prompt while preserving the subject identity from reference images. However, current R2V methods are hindered by the reliance on explicit reference image-video-text triplets, whose construction is highly expensive and difficult to scale. We bypass this bottleneck by introducing Saber, a scalable zero-shot framework that requires no explicit R2V data. Trained exclusively on video-text pairs, Saber employs a masked training strategy and a tailored attention-based model design to learn identity-consistent and reference-aware representations. Mask augmentation techniques are further integrated to mitigate copy-paste artifacts common in reference-to-video generation. Moreover, Saber demonstrates remarkable generalization capabilities across a varying number of references and achieves superior performance on the OpenS2V-Eval benchmark compared to methods trained with R2V data.

View on arXiv PDF

Similar