AllClear: A Comprehensive Dataset and Benchmark for Cloud Removal in Satellite Imagery
This addresses a critical data bottleneck for researchers and practitioners in remote sensing, enabling better cloud removal results, though it is incremental as it focuses on dataset creation rather than a new method.
The paper tackles the lack of a comprehensive benchmark and large training dataset for cloud removal in satellite imagery by introducing AllClear, the largest public dataset with 23,742 regions and 4 million images, showing that PSNR improves from 28.47 to 33.87 with 30x more data.
Clouds in satellite imagery pose a significant challenge for downstream applications. A major challenge in current cloud removal research is the absence of a comprehensive benchmark and a sufficiently large and diverse training dataset. To address this problem, we introduce the largest public dataset -- $\textit{AllClear}$ for cloud removal, featuring 23,742 globally distributed regions of interest (ROIs) with diverse land-use patterns, comprising 4 million images in total. Each ROI includes complete temporal captures from the year 2022, with (1) multi-spectral optical imagery from Sentinel-2 and Landsat 8/9, (2) synthetic aperture radar (SAR) imagery from Sentinel-1, and (3) auxiliary remote sensing products such as cloud masks and land cover maps. We validate the effectiveness of our dataset by benchmarking performance, demonstrating the scaling law -- the PSNR rises from $28.47$ to $33.87$ with $30\times$ more data, and conducting ablation studies on the temporal length and the importance of individual modalities. This dataset aims to provide comprehensive coverage of the Earth's surface and promote better cloud removal results.