SynDelay: A Synthetic Dataset for Delivery Delay Prediction
This addresses the problem of limited open datasets for researchers in supply chain AI, though it is incremental as it focuses on dataset creation rather than novel predictive methods.
The paper tackles the scarcity of high-quality datasets for delivery delay prediction in supply chain management by introducing SynDelay, a synthetic dataset generated from real-world data, which provides a challenging testbed for predictive modeling with baseline results and metrics.
Artificial intelligence (AI) is transforming supply chain management, yet progress in predictive tasks -- such as delivery delay prediction -- remains constrained by the scarcity of high-quality, openly available datasets. Existing datasets are often proprietary, small, or inconsistently maintained, hindering reproducibility and benchmarking. We present SynDelay, a synthetic dataset designed for delivery delay prediction. Generated using an advanced generative model trained on real-world data, SynDelay preserves realistic delivery patterns while ensuring privacy. Although not entirely free of noise or inconsistencies, it provides a challenging and practical testbed for advancing predictive modelling. To support adoption, we provide baseline results and evaluation metrics as initial benchmarks, serving as reference points rather than state-of-the-art claims. SynDelay is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI. We encourage the community to contribute datasets, models, and evaluation practices to advance research in this area. All code is openly accessible at https://supplychaindatahub.org.