Joaquin Gajardo, Michele Volpi, Daniel Onwude et al.
Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights without extensive field surveys. Machine learning enables large-scale mapping but depends on geo-referenced ground-truth data, which is costly to collect, motivating the use of global datasets in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance. We evaluate this in Nigeria, using 1,827 manually labelled samples covering the whole country, and subsets of the Geowiki dataset: Nigeria-only, regional (Nigeria and neighbouring countries), and global. We extract pixel-wise multi-source time series arrays from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model using Google Earth Engine, comparing Random Forests with LSTMs, including a lightweight multi-headed LSTM variant. Results show local data significantly boosts performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Nigeria-only or regional data outperformed global data despite the lower amount of labels, with the exception of the multi-headed LSTM, which benefited from global data when local samples were absent. Sentinel-1, climate, and topographic data are critical data sources, with their removal reducing F1-score by up to 0.593. Addressing class imbalance also improved LSTM accuracy by up to 0.071. Our top-performing model (Nigeria-only LSTM) achieved an F1-score of 0.814 and accuracy of 0.842, matching the best global land cover product while offering stronger recall, critical for food security. We release code, data, maps, and an interactive web app to support future work.