AI CE DC LGJul 30, 2025

Data Readiness for Scientific AI at Scale

Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

arXiv:2507.23018v15.81 citationsh-index: 4ICPP Workshops

Originality Synthesis-oriented

AI Analysis

This addresses the problem of data preprocessing bottlenecks for researchers using AI in scientific domains, though it is incremental as it builds on existing Data Readiness for AI principles.

The paper tackles the challenge of preparing large-scale scientific datasets for AI training by analyzing workflows across climate, nuclear fusion, bio/health, and materials domains, and introduces a two-dimensional readiness framework (Data Readiness Levels and Data Processing Stages) tailored to high-performance computing environments to guide infrastructure development.

This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

View on arXiv PDF

Similar