DBAIFeb 18

DataJoint 2.0: A Computational Substrate for Agentic Scientific Workflows

arXiv:2602.16585v1h-index: 2
Originality Highly original
AI Analysis

This addresses the need for operational rigor in human-agent collaboration for scientific workflows, termed SciOps, by providing a formal system to prevent data corruption.

The paper tackles the problem of fragmented provenance and lack of transactional guarantees in scientific data pipelines by introducing DataJoint 2.0, a computational substrate based on a relational workflow model that unifies data structure, dependencies, and integrity constraints into a single queryable system.

Operational rigor determines whether human-agent collaboration succeeds or fails. Scientific data pipelines need the equivalent of DevOps -- SciOps -- yet common approaches fragment provenance across disconnected systems without transactional guarantees. DataJoint 2.0 addresses this gap through the relational workflow model: tables represent workflow steps, rows represent artifacts, foreign keys prescribe execution order. The schema specifies not only what data exists but how it is derived -- a single formal system where data structure, computational dependencies, and integrity constraints are all queryable, enforceable, and machine-readable. Four technical innovations extend this foundation: object-augmented schemas integrating relational metadata with scalable object storage, semantic matching using attribute lineage to prevent erroneous joins, an extensible type system for domain-specific formats, and distributed job coordination designed for composability with external orchestration. By unifying data structure, data, and computational transformations, DataJoint creates a substrate for SciOps where agents can participate in scientific workflows without risking data corruption.

Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes