CVDec 17, 2025

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

arXiv:2512.15603v116 citationsh-index: 32Has Code
Originality Highly original
AI Analysis

This addresses the challenge of maintaining consistency during image editing for users of generative models, offering a novel approach inspired by professional design tools.

The paper tackles the problem of inconsistent image editing in visual generative models by proposing Qwen-Image-Layered, an end-to-end diffusion model that decomposes RGB images into multiple semantically disentangled RGBA layers, enabling inherent editability and significantly surpassing existing approaches in decomposition quality.

Recent visual generative models often struggle with consistency during image editing due to the entangled nature of raster images, where all visual content is fused into a single canvas. In contrast, professional design tools employ layered representations, allowing isolated edits while preserving consistency. Motivated by this, we propose \textbf{Qwen-Image-Layered}, an end-to-end diffusion model that decomposes a single RGB image into multiple semantically disentangled RGBA layers, enabling \textbf{inherent editability}, where each RGBA layer can be independently manipulated without affecting other content. To support variable-length decomposition, we introduce three key components: (1) an RGBA-VAE to unify the latent representations of RGB and RGBA images; (2) a VLD-MMDiT (Variable Layers Decomposition MMDiT) architecture capable of decomposing a variable number of image layers; and (3) a Multi-stage Training strategy to adapt a pretrained image generation model into a multilayer image decomposer. Furthermore, to address the scarcity of high-quality multilayer training images, we build a pipeline to extract and annotate multilayer images from Photoshop documents (PSD). Experiments demonstrate that our method significantly surpasses existing approaches in decomposition quality and establishes a new paradigm for consistent image editing. Our code and models are released on \href{https://github.com/QwenLM/Qwen-Image-Layered}{https://github.com/QwenLM/Qwen-Image-Layered}

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes