CVJun 30, 2025

MTADiffusion: Mask Text Alignment Diffusion Model for Object Inpainting

arXiv:2506.23482v1h-index: 4CVPR
Originality Incremental advance
AI Analysis

This work improves object inpainting for image editing applications, but it is incremental as it builds on existing diffusion models with enhancements like a new dataset and training strategy.

The paper tackled object inpainting by addressing issues like semantic misalignment and structural distortion, resulting in MTADiffusion achieving state-of-the-art performance on benchmarks such as BrushBench and EditBench.

Advancements in generative models have enabled image inpainting models to generate content within specific regions of an image based on provided prompts and masks. However, existing inpainting methods often suffer from problems such as semantic misalignment, structural distortion, and style inconsistency. In this work, we present MTADiffusion, a Mask-Text Alignment diffusion model designed for object inpainting. To enhance the semantic capabilities of the inpainting model, we introduce MTAPipeline, an automatic solution for annotating masks with detailed descriptions. Based on the MTAPipeline, we construct a new MTADataset comprising 5 million images and 25 million mask-text pairs. Furthermore, we propose a multi-task training strategy that integrates both inpainting and edge prediction tasks to improve structural stability. To promote style consistency, we present a novel inpainting style-consistency loss using a pre-trained VGG network and the Gram matrix. Comprehensive evaluations on BrushBench and EditBench demonstrate that MTADiffusion achieves state-of-the-art performance compared to other methods.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes