CVAINov 15, 2025

Image-POSER: Reflective RL for Multi-Expert Image Generation and Editing

arXiv:2511.11780v1h-index: 3
Originality Incremental advance
AI Analysis

This addresses the challenge for creative workflows where existing text-to-image models fail with complex prompts, representing an incremental improvement through novel orchestration of existing models.

The paper tackled the problem of generating and editing images from long, compositional prompts by introducing Image-POSER, a reflective reinforcement learning framework that orchestrates multiple pretrained experts, resulting in outperformance over baselines in alignment, fidelity, and aesthetics as shown in experiments and human evaluations.

Recent advances in text-to-image generation have produced strong single-shot models, yet no individual system reliably executes the long, compositional prompts typical of creative workflows. We introduce Image-POSER, a reflective reinforcement learning framework that (i) orchestrates a diverse registry of pretrained text-to-image and image-to-image experts, (ii) handles long-form prompts end-to-end through dynamic task decomposition, and (iii) supervises alignment at each step via structured feedback from a vision-language model critic. By casting image synthesis and editing as a Markov Decision Process, we learn non-trivial expert pipelines that adaptively combine strengths across models. Experiments show that Image-POSER outperforms baselines, including frontier models, across industry-standard and custom benchmarks in alignment, fidelity, and aesthetics, and is consistently preferred in human evaluations. These results highlight that reinforcement learning can endow AI systems with the capacity to autonomously decompose, reorder, and combine visual models, moving towards general-purpose visual assistants.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes