CVGRJun 20, 2025

BlenderFusion: 3D-Grounded Visual Editing and Generative Compositing

arXiv:2506.17450v23 citationsh-index: 13
Originality Incremental advance
AI Analysis

This addresses the problem of generating coherent edited scenes for applications in visual content creation, though it appears incremental as it builds on existing diffusion models and 3D editing tools.

The paper tackles the problem of complex compositional scene editing by introducing BlenderFusion, a framework that synthesizes new scenes through a layering-editing-compositing pipeline, resulting in significant outperformance over prior methods in such tasks.

We present BlenderFusion, a generative visual compositing framework that synthesizes new scenes by recomposing objects, camera, and background. It follows a layering-editing-compositing pipeline: (i) segmenting and converting visual inputs into editable 3D entities (layering), (ii) editing them in Blender with 3D-grounded control (editing), and (iii) fusing them into a coherent scene using a generative compositor (compositing). Our generative compositor extends a pre-trained diffusion model to process both the original (source) and edited (target) scenes in parallel. It is fine-tuned on video frames with two key training strategies: (i) source masking, enabling flexible modifications like background replacement; (ii) simulated object jittering, facilitating disentangled control over objects and camera. BlenderFusion significantly outperforms prior methods in complex compositional scene editing tasks.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes