CVApr 4

Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation

arXiv:2604.0373881.51 citationsh-index: 2
Predicted impact top 26% in CV · last 90 daysOriginality Incremental advance
AI Analysis

For researchers in video generation, this work addresses the specific problem of reference confusion when reference images are highly similar, offering a practical solution for multi-character consistency.

PoCo introduces position encoding as a context controller to resolve reference confusion in multi-shot video generation with multiple reference characters, achieving improved cross-shot consistency and reference fidelity over baselines.

Recent proprietary models such as Sora2 demonstrate promising progress in generating multi-shot videos conditioned on multiple reference characters. However, academic research on this problem remains limited. We study this task and identify a core challenge: when reference images exhibit highly similar appearances, the model often suffers from reference confusion, where semantically similar tokens degrade the model's ability to retrieve the correct context. To address this, we introduce PoCo (Position Embedding as a Context Controller), which incorporates position encoding as additional context control beyond semantic retrieval. By employing side information of tokens, PoCo enables precise token-level matching while preserving implicit semantic consistency modeling. Building on PoCo, we develop a multi-reference and multi-shot video generation model capable of reliably controlling characters with extremely similar visual traits. Extensive experiments demonstrate that PoCo improves cross-shot consistency and reference fidelity compared with various baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes