CVDec 18, 2025

Geometric Disentanglement of Text Embeddings for Subject-Consistent Text-to-Image Generation using A Single Prompt

arXiv:2512.16443v31 citationsh-index: 8
Originality Incremental advance
AI Analysis

This work addresses a key limitation in text-to-image generation for applications like visual storytelling, offering a more efficient solution compared to fine-tuning or image conditioning methods.

The paper tackles the problem of subject inconsistency in text-to-image diffusion models by proposing a training-free method that refines text embeddings to suppress unwanted semantics, resulting in significant improvements in both subject consistency and text alignment over existing baselines.

Text-to-image diffusion models excel at generating high-quality images from natural language descriptions but often fail to preserve subject consistency across multiple outputs, limiting their use in visual storytelling. Existing approaches rely on model fine-tuning or image conditioning, which are computationally expensive and require per-subject optimization. 1Prompt1Story, a training-free approach, concatenates all scene descriptions into a single prompt and rescales token embeddings, but it suffers from semantic leakage, where embeddings across frames become entangled, causing text misalignment. In this paper, we propose a simple yet effective training-free approach that addresses semantic entanglement from a geometric perspective by refining text embeddings to suppress unwanted semantics. Extensive experiments prove that our approach significantly improves both subject consistency and text alignment over existing baselines.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes