CLApr 1, 2025

Follow the Flow: On Information Flow Across Textual Tokens in Text-to-Image Models

Guy Kaplan, Michael Toker, Yuval Reif, Yonatan Belinkov, Roy Schwartz

arXiv:2504.01137v24 citationsh-index: 31

Originality Incremental advance

AI Analysis

This work addresses misalignment issues in text-to-image generation by analyzing token-level encoding, which is incremental as it builds on prior diffusion-focused research.

The paper investigates how semantic information is distributed across token representations in text-to-image models, finding that information is often concentrated in a few tokens and lexical items remain isolated, which can lead to misinterpretations like 'pool' representing a pool table in the prompt 'a pool by a table'.

Text-to-image (T2I) models generate images by encoding text prompts into token representations, which then guide the diffusion process. While prior work has largely focused on improving alignment by refining the diffusion process, we focus on the textual encoding stage. Specifically, we investigate how semantic information is distributed across token representations within and between lexical items (i.e., words or expressions conveying a single concept) in the prompt. We analyze information flow at two levels: (1) in-item representation-whether individual tokens represent their lexical item, and (2) cross-item interaction-whether information flows across the tokens of different lexical items. We use patching techniques to uncover surprising encoding patterns. We find information is usually concentrated in only one or two of the item's tokens-For example, in the item "San Francisco's Golden Gate Bridge", the token "Gate" sufficiently captures the entire expression while the other tokens could effectively be discarded. Lexical items also tend to remain isolated; for instance, the token "dog" encodes no visual information about "green" in the prompt "a green dog". However, in some cases, items do influence each other's representation, often leading to misinterpretations-e.g., in the prompt "a pool by a table", the token pool represents a pool table after contextualization. Our findings highlight the critical role of token-level encoding in image generation, suggesting that misalignment issues may originate already during the textual encoding.

View on arXiv PDF

Similar