CVMar 26

RefAlign: Representation Alignment for Reference-to-Video Generation

arXiv:2603.2574340.53 citationsh-index: 21
Predicted impact top 8% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses challenges in controllable video synthesis for applications like personalized advertising and virtual try-on, representing an incremental improvement over existing methods.

The paper tackles the problem of copy-paste artifacts and multi-subject confusion in reference-to-video generation by proposing RefAlign, a representation alignment framework that aligns features to a visual foundation model's semantic space, resulting in improved performance on the OpenS2V-Eval benchmark.

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes