Carl Kesselman

HC
h-index3
6papers
15citations
Novelty43%
AI Score39

6 Papers

46.1HCApr 7
Reproducibility Beyond Artifacts: Interactional Support for Collaborative Machine Learning

Zhiwei Li, Carl Kesselman

Machine learning (ML) reproducibility is often framed as a problem of incomplete artifact recording. This framing leads to systems that prioritize capturing datasets, code, configurations, and execution environments.However, in collaborative and interdisciplinary ML projects, reproducibility failures often arise not only from missing artifacts but from difficulties in interpreting prior work, aligning evolving components, and reconstructing experimental intent over time. Drawing on a 19-month deployment of a data-centric ML management system in a clinical research project, we identify recurring interactional breakdowns that persist despite comprehensive structural traceability. Based on these findings, we propose a two-layer socio-technical ML management system combining lifecycle-aware artifact infrastructure with an interactional layer designed to mediate coordination, explanation, and shared understanding. We discuss how an AI-mediated semantic interface reframes reproducibility as an ongoing socio-technical accomplishment rather than a static property of recorded traces, and outline implications for human-centered ML infrastructure design.

LGJun 19, 2025
From Data to Decision: Data-Centric Infrastructure for Reproducible ML in Collaborative eScience

Zhiwei Li, Carl Kesselman, Tran Huy Nguyen et al.

Reproducibility remains a central challenge in machine learning (ML), especially in collaborative eScience projects where teams iterate over data, features, and models. Current ML workflows are often dynamic yet fragmented, relying on informal data sharing, ad hoc scripts, and loosely connected tools. This fragmentation impedes transparency, reproducibility, and the adaptability of experiments over time. This paper introduces a data-centric framework for lifecycle-aware reproducibility, centered around six structured artifacts: Dataset, Feature, Workflow, Execution, Asset, and Controlled Vocabulary. These artifacts formalize the relationships between data, code, and decisions, enabling ML experiments to be versioned, interpretable, and traceable over time. The approach is demonstrated through a clinical ML use case of glaucoma detection, illustrating how the system supports iterative exploration, improves reproducibility, and preserves the provenance of collaborative decisions across the ML lifecycle.

LGJun 27, 2024
Deriva-ML: A Continuous FAIRness Approach to Reproducible Machine Learning Models

Zhiwei Li, Carl Kesselman, Mike D'Arch et al.

Increasingly, artificial intelligence (AI) and machine learning (ML) are used in eScience applications [9]. While these approaches have great potential, the literature has shown that ML-based approaches frequently suffer from results that are either incorrect or unreproducible due to mismanagement or misuse of data used for training and validating the models [12, 15]. Recognition of the necessity of high-quality data for correct ML results has led to data-centric ML approaches that shift the central focus from model development to creation of high-quality data sets to train and validate the models [14, 20]. However, there are limited tools and methods available for data-centric approaches to explore and evaluate ML solutions for eScience problems which often require collaborative multidisciplinary teams working with models and data that will rapidly evolve as an investigation unfolds [1]. In this paper, we show how data management tools based on the principle that all of the data for ML should be findable, accessible, interoperable and reusable (i.e. FAIR [26]) can significantly improve the quality of data that is used for ML applications. When combined with best practices that apply these tools to the entire life cycle of an ML-based eScience investigation, we can significantly improve the ability of an eScience team to create correct and reproducible ML solutions. We propose an architecture and implementation of such tools and demonstrate through two use cases how they can be used to improve ML-based eScience investigations.

SEJan 20, 2022
CUF-Links: Continuous and Ubiquitous FAIRness Linkages for reproducible research

Ian Foster, Carl Kesselman

Despite much creative work on methods and tools, reproducibility -- the ability to repeat the computational steps used to obtain a research result -- remains elusive. One reason for these difficulties is that extant tools for capturing research processes do not align well with the rich working practices of scientists. We advocate here for simple mechanisms that can be integrated easily with current work practices to capture basic information about every data product consumed or produced in a project. We argue that by thus extending the scope of findable, accessible, interoperable, and reusable (FAIR) data in both time and space to enable the creation of a continuous chain of continuous and ubiquitous FAIRness linkages (CUF-Links) from inputs to outputs, such mechanisms can provide a strong foundation for documenting the provenance linkages that are essential to reproducible research. We give examples of mechanisms that can achieve these goals, and review how they have been applied in practice.

HCOct 5, 2021
Model-Adaptive Interface Generation for Data-Driven Discovery

Hongsuda Tangmunarunkit, Aref Shafaeibejestan, Joshua Chudy et al.

Discovery of new knowledge is increasingly data-driven, predicated on a team's ability to collaboratively create, find, analyze, retrieve, and share pertinent datasets over the duration of an investigation. This is especially true in the domain of scientific discovery where generation, analysis, and interpretation of data are the fundamental mechanisms by which research teams collaborate to achieve their shared scientific goal. Data-driven discovery in general, and scientific discovery in particular, is distinguished by complex and diverse data models and formats that evolve over the lifetime of an investigation. While databases and related information systems have the potential to be valuable tools in the discovery process, developing effective interfaces for data-driven discovery remains a roadblock to the application of database technology as an essential tool in scientific investigations. In this paper, we present a model-adaptive approach to creating interaction environments for data-driven discovery of scientific data that automatically generates interactive user interfaces for editing, searching, and viewing scientific data based entirely on introspection of an extended relational data model. We have applied model-adaptive interface generation to many active scientific investigations spanning domains of proteomics, bioinformatics, neuroscience, occupational therapy, stem cells, genitourinary, craniofacial development, and others. We present the approach, its implementation, and its evaluation through analysis of its usage in diverse scientific settings.

DBOct 19, 2016
ERMrest: an entity-relationship data storage service for web-based, data-oriented collaboration

Karl Czajkowski, Carl Kesselman, Robert Schuler et al.

Scientific discovery is increasingly dependent on a scientist's ability to acquire, curate, integrate, analyze, and share large and diverse collections of data. While the details vary from domain to domain, these data often consist of diverse digital assets (e.g. image files, sequence data, or simulation outputs) that are organized with complex relationships and context which may evolve over the course of an investigation. In addition, discovery is often collaborative, such that sharing of the data and its organizational context is highly desirable. Common systems for managing file or asset metadata hide their inherent relational structures, while traditional relational database systems do not extend to the distributed collaborative environment often seen in scientific investigations. To address these issues, we introduce ERMrest, a collaborative data management service which allows general entity-relationship modeling of metadata manipulated by RESTful access methods. We present the design criteria, architecture, and service implementation, as well as describe an ecosystem of tools and services that we have created to integrate metadata into an end-to-end scientific data life cycle. ERMrest has been deployed to hundreds of users across multiple scientific research communities and projects. We present two representative use cases: an international consortium and an early-phase, multidisciplinary research project.