CV AIJun 16, 2025

Quantitative Comparison of Fine-Tuning Techniques for Pretrained Latent Diffusion Models in the Generation of Unseen SAR Images

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Olivier Lévêque, Elise Colin

arXiv:2506.13307v2h-index: 2Has CodeIsprs Journal of Photogrammetry and Remote Sensing

Originality Synthesis-oriented

AI Analysis

This work addresses the problem of generating realistic SAR images for data augmentation and scenario simulation in Earth observation, representing an incremental advancement by adapting existing methods to a specific domain.

The paper tackles adapting a pretrained latent diffusion model to generate high-resolution Synthetic Aperture Radar (SAR) images, enabling controllable synthesis of rare or out-of-distribution scenes, with results showing that a hybrid fine-tuning strategy best preserves SAR geometry and texture while maintaining prompt fidelity.

We present a framework for adapting a large pretrained latent diffusion model to high-resolution Synthetic Aperture Radar (SAR) image generation. The approach enables controllable synthesis and the creation of rare or out-of-distribution scenes beyond the training set. Rather than training a task-specific small model from scratch, we adapt an open-source text-to-image foundation model to the SAR modality, using its semantic prior to align prompts with SAR imaging physics (side-looking geometry, slant-range projection, and coherent speckle with heavy-tailed statistics). Using a 100k-image SAR dataset, we compare full fine-tuning and parameter-efficient Low-Rank Adaptation (LoRA) across the UNet diffusion backbone, the Variational Autoencoder (VAE), and the text encoders. Evaluation combines (i) statistical distances to real SAR amplitude distributions, (ii) textural similarity via Gray-Level Co-occurrence Matrix (GLCM) descriptors, and (iii) semantic alignment using a SAR-specialized CLIP model. Our results show that a hybrid strategy-full UNet tuning with LoRA on the text encoders and a learned token embedding-best preserves SAR geometry and texture while maintaining prompt fidelity. The framework supports text-based control and multimodal conditioning (e.g., segmentation maps, TerraSAR-X, or optical guidance), opening new paths for large-scale SAR scene data augmentation and unseen scenario simulation in Earth observation.

View on arXiv PDF

Similar