CVLGMay 12

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

arXiv:2605.1172278.01 citations
Predicted impact top 28% in CV · last 90 daysOriginality Incremental advance
AI Analysis

This work addresses the persistent challenge of compositional faithfulness in text-to-image generation for practitioners, offering a practical and efficient inference-time solution without requiring model retraining.

EPIC introduces a training-free inference-time refinement framework for compositional text-to-image generation that uses predicate-guided search to verify and iteratively correct generated images. On GenEval2, it improves prompt-level accuracy from 34.16% to 71.46%, outperforming prior refinement baselines by 19.23 points while reducing computational costs by 31-81%.

Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes