Physics-Guided VLM Priors for All-Cloud Removal
This work is significant for remote sensing analysts and applications requiring clear optical imagery, as it offers a unified and more accurate solution for removing diverse cloud types, which is an incremental improvement over existing separated pipelines.
This paper addresses the challenge of cloud removal in optical remote sensing, which is complicated by heterogeneous cloud types. The authors propose PhyVLM-CR, a method that integrates a Vision-Language Model (VLM) with a physical restoration model to achieve unified cloud removal, eliminating the need for explicit cloud-type decisions and improving quantitative accuracy compared to existing methods.
Cloud removal is a fundamental challenge in optical remote sensing due to the heterogeneous degradation. Thin clouds distort radiometry via partial transmission, while thick clouds occlude the surface. Existing pipelines separate thin-cloud correction from thick-cloud reconstruction, requiring explicit cloud-type decisions and often leading to error accumulation and discontinuities in mixed-cloud scenes. Therefore, a novel approach named Physical-VLM All-Cloud Removal (PhyVLM-CR) that integrates the semantic capability of Vision-Language Model (VLM) into a physical restoration model, achieving high-fidelity unified cloud removal. Specifically, the cognitive prior from a VLM (e.g., Qwen) is transformed into physical scattering parameters and a hallucination confidence map. Leveraging this confidence map as a continuous soft gate, our method achieves a unified restoration via adaptive weighting: it prioritizes physical inversion in high-transmission regions to preserve radiometric fidelity, while seamlessly transitioning to temporal reference reconstruction in low-confidence occluded areas. This mechanism eliminates the need for explicit boundary delineation, ensuring a coherent removal across heterogeneous cloud covers. Experiments on real-world Sentinel-2 surface reflectance imagery confirm that our approach achieves a remarkable balance between cloud removal and content preservation, delivering hallucination-free results with substantially improved quantitative accuracy compared to existing methods.