CC-Diff: Enhancing Contextual Coherence in Remote Sensing Image Synthesis
This addresses the challenge of unrealistic synthesis in remote sensing applications, though it appears incremental as it builds on existing diffusion models with domain-specific modifications.
The paper tackles the problem of incoherent foreground-background relationships in remote sensing image synthesis by introducing CC-Diff, a diffusion model that enhances contextual coherence through a Dual Re-sampler and foreground-aware attention, resulting in improved quality metrics and boosting detection accuracy by 1.83 mAP on DOTA and 2.25 mAP on COCO.
Existing image synthesis methods for natural scenes focus primarily on foreground control, often reducing the background to simplistic textures. Consequently, these approaches tend to overlook the intrinsic correlation between foreground and background, which may lead to incoherent and unrealistic synthesis results in remote sensing (RS) scenarios. In this paper, we introduce CC-Diff, a $\underline{\textbf{Diff}}$usion Model-based approach for RS image generation with enhanced $\underline{\textbf{C}}$ontext $\underline{\textbf{C}}$oherence. Specifically, we propose a novel Dual Re-sampler for feature extraction, with a built-in `Context Bridge' to explicitly capture the intricate interdependency between foreground and background. Moreover, we reinforce their connection by employing a foreground-aware attention mechanism during the generation of background features, thereby enhancing the plausibility of the synthesized context. Extensive experiments show that CC-Diff outperforms state-of-the-art methods across critical quality metrics, excelling in the RS domain and effectively generalizing to natural images. Remarkably, CC-Diff also shows high trainability, boosting detection accuracy by 1.83 mAP on DOTA and 2.25 mAP on the COCO benchmark.