QuitoBench: A High-Quality Open Time Series Forecasting Benchmark
This provides a new benchmark for time series forecasting researchers, addressing a fundamental gap in the field, though it is incremental as it builds on existing data and methods.
The paper tackles the bottleneck of scarce high-quality benchmarks in time series forecasting by introducing QuitoBench, a regime-balanced benchmark built on a billion-scale corpus from Alipay, and reports key findings such as a context-length crossover where deep learning models lead at short context but foundation models dominate at long context, with forecastability driving a 3.64× MAE gap across regimes.
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. To address this gap, we introduce \textsc{QuitoBench}, a regime-balanced benchmark for time series forecasting with coverage across eight trend$\times$seasonality$\times$forecastability (TSF) regimes, designed to capture forecasting-relevant properties rather than application-defined domain labels. The benchmark is built upon \textsc{Quito}, a billion-scale time series corpus of application traffic from Alipay spanning nine business domains. Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we report four key findings: (i) a context-length crossover where deep learning models lead at short context ($L=96$) but foundation models dominate at long context ($L \ge 576$); (ii) forecastability is the dominant difficulty driver, producing a $3.64 \times$ MAE gap across regimes; (iii) deep learning models match or surpass foundation models at $59 \times$ fewer parameters; and (iv) scaling the amount of training data provides substantially greater benefit than scaling model size for both model families. These findings are validated by strong cross-benchmark and cross-metric consistency. Our open-source release enables reproducible, regime-aware evaluation for time series forecasting research.