JEDI: The Force of Jensen-Shannon Divergence in Disentangling Diffusion Models
This addresses the issue of poor compositional control in diffusion models for users in image generation, though it is incremental as it builds on existing adaptation techniques.
The paper tackles the problem of semantic entanglement in diffusion models by introducing JEDI, a test-time adaptation method that improves subject separation and compositional alignment without retraining, achieving consistent enhancements in prompt alignment for models like Stable Diffusion 1.5 and 3.5.
We introduce JEDI, a test-time adaptation method that enhances subject separation and compositional alignment in diffusion models without requiring retraining or external supervision. JEDI operates by minimizing semantic entanglement in attention maps using a novel Jensen-Shannon divergence based objective. To improve efficiency, we leverage adversarial optimization, reducing the number of updating steps required. JEDI is model-agnostic and applicable to architectures such as Stable Diffusion 1.5 and 3.5, consistently improving prompt alignment and disentanglement in complex scenes. Additionally, JEDI provides a lightweight, CLIP-free disentanglement score derived from internal attention distributions, offering a principled benchmark for compositional alignment under test-time conditions. Code and results are available at https://ericbill21.github.io/JEDI/.