DeCorStory: Gram-Schmidt Prompt Embedding Decorrelation for Consistent Storytelling
This addresses the problem of inconsistent storytelling in text-to-image generation for users needing coherent multi-frame outputs, though it is incremental as it builds on existing diffusion pipelines without model modifications.
The paper tackled the challenge of maintaining visual and semantic consistency in text-to-image storytelling by proposing DeCorStory, a training-free inference-time framework that reduces inter-frame semantic interference, resulting in state-of-the-art performance among training-free baselines with improvements in prompt-image alignment, identity consistency, and visual diversity.
Maintaining visual and semantic consistency across frames is a key challenge in text-to-image storytelling. Existing training-free methods, such as One-Prompt-One-Story, concatenate all prompts into a single sequence, which often induces strong embedding correlation and leads to color leakage, background blending, and identity drift. We propose DeCorStory, a training-free inference-time framework that explicitly reduces inter-frame semantic interference. DeCorStory applies Gram-Schmidt prompt embedding decorrelation to orthogonalize frame-level semantics, followed by singular value reweighting to strengthen prompt-specific information and identity-preserving cross-attention to stabilize character identity during diffusion. The method requires no model modification or fine-tuning and can be seamlessly integrated into existing diffusion pipelines. Experiments demonstrate consistent improvements in prompt-image alignment, identity consistency, and visual diversity, achieving state-of-the-art performance among training-free baselines. Code is available at: https://github.com/YuZhenyuLindy/DeCorStory