CVOct 28, 2023Code
Foundation Models for Generalist Geospatial Artificial IntelligenceJohannes Jakubik, Sujit Roy, C. E. Phillips et al.
Significant progress in the development of highly adaptable and reusable Artificial Intelligence (AI) models is expected to have a significant impact on Earth science and remote sensing. Foundation models are pre-trained on large unlabeled datasets through self-supervision, and then fine-tuned for various downstream tasks with small labeled datasets. This paper introduces a first-of-a-kind framework for the efficient pre-training and fine-tuning of foundational models on extensive geospatial data. We have utilized this framework to create Prithvi, a transformer-based geospatial foundational model pre-trained on more than 1TB of multispectral satellite imagery from the Harmonized Landsat-Sentinel 2 (HLS) dataset. Our study demonstrates the efficacy of our framework in successfully fine-tuning Prithvi to a range of Earth observation tasks that have not been tackled by previous work on foundation models involving multi-temporal cloud gap imputation, flood mapping, wildfire scar segmentation, and multi-temporal crop segmentation. Our experiments show that the pre-trained model accelerates the fine-tuning process compared to leveraging randomly initialized weights. In addition, pre-trained Prithvi compares well against the state-of-the-art, e.g., outperforming a conditional GAN model in multi-temporal cloud imputation by up to 5pp (or 5.7%) in the structural similarity index. Finally, due to the limited availability of labeled data in the field of Earth observation, we gradually reduce the quantity of available labeled data for refining the model to evaluate data efficiency and demonstrate that data can be decreased significantly without affecting the model's accuracy. The pre-trained 100 million parameter model and corresponding fine-tuning workflows have been released publicly as open source contributions to the global Earth sciences community through Hugging Face.
LGJun 6, 2023
GEO-Bench: Toward Foundation Models for Earth MonitoringAlexandre Lacoste, Nils Lehmann, Pau Rodriguez et al.
Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to substantial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.
CVAug 12, 2024Code
Generalization Enhancement Strategies to Enable Cross-year Cropland Mapping with Convolutional Neural Networks Trained Using Historical SamplesSam Khallaghi, Rahebe Abedi, Hanan Abou Ali et al.
The accuracy of mapping agricultural fields across large areas is steadily improving with high-resolution satellite imagery and deep learning (DL) models, even in regions where fields are small and geometrically irregular. However, developing effective DL models often requires large, expensive label datasets, typically available only for specific years or locations. This limits the ability to create annual maps essential for agricultural monitoring, as domain shifts occur between years and regions due to changes in farming practices and environmental conditions. The challenge is to design a model flexible enough to account for these shifts without needing yearly labels. While domain adaptation techniques or semi-supervised training are common solutions, we explored enhancing the model's generalization power. Our results indicate that a holistic approach is essential, combining methods to improve generalization. Specifically, using an area-based loss function, such as Tversky-focal loss (TFL), significantly improved predictions across multiple years. The use of different augmentation techniques helped to encode different types of invariance, particularly photometric augmentations encoded invariance to brightness changes, though they increased false positives. The combination of photometric augmentation, TFL loss, and MC-dropout produced the best results, although dropout alone led to more false negatives in subsequent year predictions. Additionally, the choice of input normalization had a significant impact, with the best results obtained when statistics were calculated either locally or across the entire dataset over all bands (lab and gab). We developed a workflow that enabled a U-Net model to generate effective multi-year crop maps over large areas. Our code, available at: https://github.com/agroimpacts/cnn-generalization-enhancement, will be regularly updated with improvements.
63.3CVMay 12
No One Knows the State of the Art in Geospatial Foundation ModelsIsaac Corley, Nils Lehmann, Caleb Robinson et al.
Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.
CVDec 3, 2024
Prithvi-EO-2.0: A Versatile Multi-Temporal Foundation Model for Earth Observation ApplicationsDaniela Szwarcman, Sujit Roy, Paolo Fraccaro et al.
This technical report presents Prithvi-EO-2.0, a new geospatial foundation model that offers significant improvements over its predecessor, Prithvi-EO-1.0. Trained on 4.2M global time series samples from NASA's Harmonized Landsat and Sentinel-2 data archive at 30m resolution, the new 300M and 600M parameter models incorporate temporal and location embeddings for enhanced performance across various geospatial tasks. Through extensive benchmarking with GEO-Bench, the 600M version outperforms the previous Prithvi-EO model by 8\% across a range of tasks. It also outperforms six other geospatial foundation models when benchmarked on remote sensing tasks from different domains and resolutions (i.e. from 0.1m to 15m). The results demonstrate the versatility of the model in both classical earth observation and high-resolution applications. Early involvement of end-users and subject matter experts (SMEs) are among the key factors that contributed to the project's success. In particular, SME involvement allowed for constant feedback on model and dataset design, as well as successful customization for diverse SME-led applications in disaster response, land use and crop mapping, and ecosystem dynamics monitoring. Prithvi-EO-2.0 is available on Hugging Face and IBM terratorch, with additional resources on GitHub. The project exemplifies the Trusted Open Science approach embraced by all involved organizations.
CVApr 30, 2024
Seeing Through the Clouds: Cloud Gap Imputation with Prithvi Foundation ModelDenys Godwin, Hanxi Li, Michael Cecil et al.
Filling cloudy pixels in multispectral satellite imagery is essential for accurate data analysis and downstream applications, especially for tasks which require time series data. To address this issue, we compare the performance of a foundational Vision Transformer (ViT) model with a baseline Conditional Generative Adversarial Network (CGAN) model for missing value imputation in time series of multispectral satellite imagery. We randomly mask time series of satellite images using real-world cloud masks and train each model to reconstruct the missing pixels. The ViT model is fine-tuned from a pretrained model, while the CGAN is trained from scratch. Using quantitative evaluation metrics such as structural similarity index and mean absolute error as well as qualitative visual analysis, we assess imputation accuracy and contextual preservation.
CVDec 1, 2024
Local vs. Global: Local Land-Use and Land-Cover Models Deliver Higher Quality MapsGirmaw Abebe Tadesse, Caleb Robinson, Charles Mwangi et al.
In 2023, 58.0% of the African population experienced moderate to severe food insecurity, with 21.6% facing severe food insecurity. Land-use and land-cover maps provide crucial insights for addressing food insecurity by improving agricultural efforts, including mapping and monitoring crop types and estimating yield. The development of global land-cover maps has been facilitated by the increasing availability of earth observation data and advancements in geospatial machine learning. However, these global maps exhibit lower accuracy and inconsistencies in Africa, partly due to the lack of representative training data. To address this issue, we propose a data-centric framework with a teacher-student model setup, which uses diverse data sources of satellite images and label examples to produce local land-cover maps. Our method trains a high-resolution teacher model on images with a resolution of 0.331 m/pixel and a low-resolution student model on publicly available images with a resolution of 10 m/pixel. The student model also utilizes the teacher model's output as its weak label examples through knowledge transfer. We evaluated our framework using Murang'a county in Kenya, renowned for its agricultural productivity, as a use case. Our local models achieved higher quality maps, with improvements of 0.14 in the F1 score and 0.21 in Intersection-over-Union, compared to the best global model. Our evaluation also revealed inconsistencies in existing global maps, with a maximum agreement rate of 0.30 among themselves. Our work provides valuable guidance to decision-makers for driving informed decisions to enhance food security.
CVNov 19, 2025
GEO-Bench-2: From Performance to Capability, Rethinking Evaluation in Geospatial AINaomi Simumba, Nils Lehmann, Paolo Fraccaro et al.
Geospatial Foundation Models (GeoFMs) are transforming Earth Observation (EO), but evaluation lacks standardized protocols. GEO-Bench-2 addresses this with a comprehensive framework spanning classification, segmentation, regression, object detection, and instance segmentation across 19 permissively-licensed datasets. We introduce ''capability'' groups to rank models on datasets that share common characteristics (e.g., resolution, bands, temporality). This enables users to identify which models excel in each capability and determine which areas need improvement in future work. To support both fair comparison and methodological innovation, we define a prescriptive yet flexible evaluation protocol. This not only ensures consistency in benchmarking but also facilitates research into model adaptation strategies, a key and open challenge in advancing GeoFMs for downstream tasks. Our experiments show that no single model dominates across all tasks, confirming the specificity of the choices made during architecture design and pretraining. While models pretrained on natural images (ConvNext ImageNet, DINO V3) excel on high-resolution tasks, EO-specific models (TerraMind, Prithvi, and Clay) outperform them on multispectral applications such as agriculture and disaster response. These findings demonstrate that optimal model choice depends on task requirements, data modalities, and constraints. This shows that the goal of a single GeoFM model that performs well across all tasks remains open for future research. GEO-Bench-2 enables informed, reproducible GeoFM evaluation tailored to specific use cases. Code, data, and leaderboard for GEO-Bench-2 are publicly released under a permissive license.
CVNov 27, 2025
Leveraging AI multimodal geospatial foundation models for improved near-real-time flood mapping at a global scaleMirela G. Tulbure, Julio Caineta, Mark Broich et al.
Floods are among the most damaging weather-related hazards, and in 2024, the warmest year on record, extreme flood events affected communities across five continents. Earth observation (EO) satellites provide critical, frequent coverage for mapping inundation, yet operational accuracy depends heavily on labeled datasets and model generalization. Recent Geospatial Foundation Models (GFMs), such as ESA-IBM's TerraMind, offer improved generalizability through large-scale self-supervised pretraining, but their performance on diverse global flood events remains poorly understood. We fine-tune TerraMind for flood extent mapping using FloodsNet, a harmonized multimodal dataset containing co-located Sentinel-1 (Synthetic Aperture Radar, SAR data) and Sentinel-2 (optical) imagery for 85 flood events worldwide. We tested four configurations (base vs. large models; frozen vs. unfrozen backbones) and compared against the TerraMind Sen1Floods11 example and a U-Net trained on both FloodsNet and Sen1Floods11. The base-unfrozen configuration provided the best balance of accuracy, precision, and recall at substantially lower computational cost than the large model. The large unfrozen model achieved the highest recall. Models trained on FloodsNet outperformed the Sen1Floods11-trained example in recall with similar overall accuracy. U-Net achieved higher recall than all GFM configurations, though with slightly lower accuracy and precision. Our results demonstrate that integrating multimodal optical and SAR data and fine-tuning a GFM can enhance near-real-time flood mapping. This study provides one of the first global-scale evaluations of a GFM for flood segmentation, highlighting both its potential and current limitations for climate adaptation and disaster resilience.
LGDec 1, 2021
Toward Foundation Models for Earth Monitoring: Proposal for a Climate Change BenchmarkAlexandre Lacoste, Evan David Sherwin, Hannah Kerner et al.
Recent progress in self-supervision shows that pre-training large neural networks on vast amounts of unsupervised data can lead to impressive increases in generalisation for downstream tasks. Such models, recently coined as foundation models, have been transformational to the field of natural language processing. While similar models have also been trained on large corpuses of images, they are not well suited for remote sensing data. To stimulate the development of foundation models for Earth monitoring, we propose to develop a new benchmark comprised of a variety of downstream tasks related to climate change. We believe that this can lead to substantial improvements in many existing applications and facilitate the development of new applications. This proposal is also a call for collaboration with the aim of developing a better evaluation process to mitigate potential downsides of foundation models for Earth monitoring.
CVDec 5, 2020
LandCoverNet: A global benchmark land cover classification training datasetHamed Alemohammad, Kevin Booth
Regularly updated and accurate land cover maps are essential for monitoring 14 of the 17 Sustainable Development Goals. Multispectral satellite imagery provide high-quality and valuable information at global scale that can be used to develop land cover classification models. However, such a global application requires a geographically diverse training dataset. Here, we present LandCoverNet, a global training dataset for land cover classification based on Sentinel-2 observations at 10m spatial resolution. Land cover class labels are defined based on annual time-series of Sentinel-2, and verified by consensus among three human annotators.
CVDec 5, 2020
Generating Synthetic Multispectral Satellite Imagery from Sentinel-2Tharun Mohandoss, Aditya Kulkarni, Daniel Northrup et al.
Multi-spectral satellite imagery provides valuable data at global scale for many environmental and socio-economic applications. Building supervised machine learning models based on these imagery, however, may require ground reference labels which are not available at global scale. Here, we propose a generative model to produce multi-resolution multi-spectral imagery based on Sentinel-2 data. The resulting synthetic images are indistinguishable from real ones by humans. This technique paves the road for future work to generate labeled synthetic imagery that can be used for data augmentation in data scarce regions and applications.
CVDec 5, 2020
Semantic Segmentation of Medium-Resolution Satellite Imagery using Conditional Generative Adversarial NetworksAditya Kulkarni, Tharun Mohandoss, Daniel Northrup et al.
Semantic segmentation of satellite imagery is a common approach to identify patterns and detect changes around the planet. Most of the state-of-the-art semantic segmentation models are trained in a fully supervised way using Convolutional Neural Network (CNN). The generalization property of CNN is poor for satellite imagery because the data can be very diverse in terms of landscape types, image resolutions, and scarcity of labels for different geographies and seasons. Hence, the performance of CNN doesn't translate well to images from unseen regions or seasons. Inspired by Conditional Generative Adversarial Networks (CGAN) based approach of image-to-image translation for high-resolution satellite imagery, we propose a CGAN framework for land cover classification using medium-resolution Sentinel-2 imagery. We find that the CGAN model outperforms the CNN model of similar complexity by a significant margin on an unseen imbalanced test dataset.
CVApr 23, 2020
Proceedings of the ICLR Workshop on Computer Vision for Agriculture (CV4A) 2020Yannis Kalantidis, Laura Sevilla-Lara, Ernest Mwebaze et al.
This is the proceedings of the Computer Vision for Agriculture (CV4A) Workshop that was held in conjunction with the International Conference on Learning Representations (ICLR) 2020. The Computer Vision for Agriculture (CV4A) 2020 workshop was scheduled to be held in Addis Ababa, Ethiopia, on April 26th, 2020. It was held virtually that same day due to the COVID-19 pandemic. The workshop was held in conjunction with the International Conference on Learning Representations (ICLR) 2020.
CVNov 14, 2018
Generating a Training Dataset for Land Cover Classification to Advance Global DevelopmentYoni Nachmany, Hamed Alemohammad
Semantic segmentation of land cover classes is fundamental for agricultural and economic development work, from sustainable forestry to urban planning, yet existing training datasets have significant limitations. To generate an open and comprehensive training library of high resolution Earth imagery and high quality land cover classifications, public Sentinel-2 data at 10 m spatial resolution was matched with accurate GlobeLand30 labels from 2010, which were filtered by agreement with an intermediary Sentinel-2 classification at 20 m produced during atmospheric correction. Scene-level classifications were predicted by Random Forests trained on valid reflectance data and the filtered labels, and achieved over 80% model accuracy for a variety of locations. Further work is required to aggregate individual scene classifications for annual labels and to test the approach in more locations, before crowdsourcing human validation. The goal is to create a sustained community-wide effort to generate image labels not only for land cover, but also very specific images for major agriculture crops across the world and other thematic categories of interest to the global development community.