YieldSAT: A Multimodal Benchmark Dataset for High-Resolution Crop Yield Prediction
This dataset addresses the problem of limited data for scalable crop yield prediction models, benefiting agricultural researchers and practitioners, though it is incremental as it builds on existing data collection efforts.
The authors tackled the scarcity of high-quality crop yield prediction datasets by releasing YieldSAT, a large multimodal dataset spanning multiple countries and crop types with over 12.2 million yield samples at 10 m resolution, and demonstrated its use for pixel regression tasks while addressing distribution shifts with a domain-informed Deep Ensemble approach that achieved significant performance gains.
Crop yield prediction requires substantial data to train scalable models. However, creating yield prediction datasets is constrained by high acquisition costs, heterogeneous data quality, and data privacy regulations. Consequently, existing datasets are scarce, low in quality, or limited to regional levels or single crop types, hindering the development of scalable data-driven solutions. In this work, we release YieldSAT, a large, high-quality, and multimodal dataset for high-resolution crop yield prediction. YieldSAT spans various climate zones across multiple countries, including Argentina, Brazil, Uruguay, and Germany, and includes major crop types, including corn, rapeseed, soybeans, and wheat, across 2,173 expert-curated fields. In total, over 12.2 million yield samples are available, each with a spatial resolution of 10 m. Each field is paired with multispectral satellite imagery, resulting in 113,555 labeled satellite images, complemented by auxiliary environmental data. We demonstrate the potential of large-scale and high-resolution crop yield prediction as a pixel regression task by comparing various deep learning models and data fusion architectures. Furthermore, we highlight open challenges arising from severe distribution shifts in the ground truth data under real-world conditions. To mitigate this, we explore a domain-informed Deep Ensemble approach that exhibits significant performance gains. The dataset is available at https://yieldsat.github.io/.