RSEdit: Text-Guided Image Editing for Remote Sensing
This addresses the need for reliable image editing in remote sensing applications, such as environmental monitoring and urban planning, by adapting existing methods to domain-specific constraints, making it an incremental improvement.
The paper tackles the problem of text-guided image editing for remote sensing imagery, where general-domain editors cause artifacts and break orthographic constraints, and presents RSEdit, a framework that adapts pretrained diffusion models to achieve precise, physically coherent edits, showing clear gains over baselines across diverse scenarios like disaster impacts and urban growth.
General-domain text-guided image editors achieve strong photorealism but introduce artifacts, hallucinate objects, and break the orthographic constraints of remote sensing (RS) imagery. We trace this gap to two high-level causes: (i) limited RS world knowledge in pre-trained models, and (ii) conditioning schemes that misalign with the bi-temporal structure and spatial priors of Earth observation data. We present RSEdit, a unified framework that adapts pretrained text-to-image diffusion models - both U-Net and DiT - into instruction-following RS editors via channel concatenation and in-context token concatenation. Trained on over 60,000 semantically rich bi-temporal remote sensing image pairs, RSEdit learns precise, physically coherent edits while preserving geospatial content. Experiments show clear gains over general and commercial baselines, demonstrating strong generalizability across diverse scenarios including disaster impacts, urban growth, and seasonal shifts, positioning RSEdit as a robust data engine for downstream analysis. We will release code, pretrained models, evaluation protocols, training logs, and generated results for full reproducibility. Code: https://github.com/Bili-Sakura/RSEdit-Preview