CVAIMar 8

How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

arXiv:2603.07540v1
Predicted impact top 8% in CV · last 90 daysOriginality Highly original
AI Analysis

This work addresses a critical reliability gap in long-horizon interleaved image generation for unified multimodal models, preventing quality collapse as sequences grow.

Unified multimodal models struggle with generating long, interleaved text and image sequences due to a "visual pollution" problem where accumulated visual history degrades generation quality. The proposed UniLongGen strategy addresses this by dynamically curating the model's memory, discarding interfering visual signals, which significantly improves long-horizon fidelity and consistency while reducing memory and inference time.

Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes