Data-efficient flood depth prediction through domain-aware coreset selection and tabular foundation models
This work addresses the need for data-efficient and transferable flood prediction models for hydrologists and emergency managers, enabling rapid deployment across watersheds with minimal local data.
The paper proposes a domain-aware coreset selection pipeline that conditions a tabular foundation model for flood depth prediction, achieving a mean R² of 0.663 across nine watersheds using only 0.7% of training data, which is within 98.5% of the supervised reference (R²=0.673). The model transfers to held-out watersheds without retraining.
Near-real-time flood depth prediction demands surrogate models that are accurate, fast, and transferable across watersheds. Supervised surrogates can match physics-based simulators in accuracy but need millions of training rows per watershed and cannot extrapolate beyond their original mesh. We propose a domain-aware coreset construction pipeline that conditions a tabular foundation model at inference time. The pipeline stratifies storms by return period and most-affected watershed, then samples hexagons with a target-aware spatial selector. With 0.7% of the per-watershed training pool, the model attains a mean $R^2$ of 0.663 across nine Houston-area watersheds, within 98.5% of the supervised reference ($R^2$ = 0.673). It transfers to held-out watersheds without task-specific retraining, staying ahead of a coreset-trained supervised baseline. On real storms it exceeds the supervised reference on a far out-of-distribution case and trails it on a mostly in-distribution one. Domain-aware coreset construction lets tabular foundation models deliver data-efficient, watershed-transferable flood predictions without per-watershed training.