CVLGSep 30, 2025

CO3: Contrasting Concepts Compose Better

arXiv:2509.25940v11 citationsh-index: 53
Originality Incremental advance
AI Analysis

This addresses a specific issue in text-to-image generation for users needing reliable multi-concept outputs, though it is an incremental improvement as it builds on existing guidance schemes without retraining models.

The paper tackled the problem of multi-concept prompt fidelity in text-to-image diffusion models, where prompts like 'a cat and a dog' often result in missing or distorted concepts, and proposed CO3, a corrective sampling strategy that improved concept coverage, balance, and robustness with fewer dropped or distorted concepts compared to baselines.

We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like "a cat and a dog" that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards "pure" joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes