CVMar 26, 2025

Latent Beam Diffusion Models for Generating Visual Sequences

arXiv:2503.20429v33 citationsh-index: 37
Originality Incremental advance
AI Analysis

This addresses the challenge of visual consistency in image sequence generation for applications like storytelling, though it is an incremental improvement over existing diffusion methods.

The paper tackled the problem of generating coherent visual sequences with diffusion models, which often produce disjointed narratives when generating images independently. The result was a beam search strategy that improved sequence coherence, visual continuity, and textual alignment, as confirmed by human and automatic evaluations.

While diffusion models excel at generating high-quality images from text prompts, they struggle with visual consistency when generating image sequences. Existing methods generate each image independently, leading to disjointed narratives - a challenge further exacerbated in non-linear storytelling, where scenes must connect beyond adjacent images. We introduce a novel beam search strategy for latent space exploration, enabling conditional generation of full image sequences with beam search decoding. In contrast to earlier methods that rely on fixed latent priors, our method dynamically samples past latents to search for an optimal sequence of latent representations, ensuring coherent visual transitions. As the latent denoising space is explored, the beam search graph is pruned with a cross-attention mechanism that efficiently scores search paths, prioritizing alignment with both textual prompts and visual context. Human and automatic evaluations confirm that BeamDiffusion outperforms other baseline methods, producing full sequences with superior coherence, visual continuity, and textual alignment.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes