IMSep 22, 2021
Astronomical Pipeline Provenance: A Use Case EvaluationMichael A. C. Johnson, Marcus Paradies, Marta Dembska et al.
In this decade astronomy is undergoing a paradigm shift to handle data from next generation observatories such as the Square Kilometre Array (SKA) or the Vera C. Rubin Observatory (LSST). Producing real time data streams of up to 10 TB/s and data products of the order of 600 Pbytes/year, the SKA will be the biggest civil data producing machine of the world that demands novel solutions on how these data volumes can be stored and analysed. Through the use of complex, automated pipelines the provenance of this real time data processing is key to establish confidence within the system, its final data products, and ultimately its scientific results. The intention of this paper is to lay the foundation for making an automated provenance generation tool for astronomical/data-processing pipelines. We therefore present a use case analysis, specific to the astronomical needs which addresses the issues of trust and reproducibility as well as other ulterior use cases which are of interest to astronomers. This analysis is subsequently used as the basis to discuss the requirements, challenges, and opportunities involved in designing both the tool and the associated provenance model.
AIJul 14, 2021
The I-ADOPT Interoperability Framework for FAIRer data descriptions of biodiversityBarbara Magagna, Ilaria Rosati, Maria Stoica et al.
Biodiversity, the variation within and between species and ecosystems, is essential for human well-being and the equilibrium of the planet. It is critical for the sustainable development of human society and is an important global challenge. Biodiversity research has become increasingly data-intensive and it deals with heterogeneous and distributed data made available by global and regional initiatives, such as GBIF, ILTER, LifeWatch, BODC, PANGAEA, and TERN, that apply different data management practices. In particular, a variety of metadata and semantic resources have been produced by these initiatives to describe biodiversity observations, introducing interoperability issues across data management systems. To address these challenges, the InteroperAble Descriptions of Observable Property Terminology WG (I-ADOPT WG) was formed by a group of international terminology providers and data center managers in 2019 with the aim to build a common approach to describe what is observed, measured, calculated, or derived. Based on an extensive analysis of existing semantic representations of variables, the WG has recently published the I-ADOPT framework ontology to facilitate interoperability between existing semantic resources and support the provision of machine-readable variable descriptions whose components are mapped to FAIR vocabulary terms. The I-ADOPT framework ontology defines a set of high level semantic components that can be used to describe a variety of patterns commonly found in scientific observations. This contribution will focus on how the I-ADOPT framework can be applied to represent variables commonly used in the biodiversity domain.
IRJun 16, 2019
ConTrOn: Continuously Trained Ontology based on Technical Data Sheets and WikidataKobkaew Opasjumruskit, Diana Peters, Sirko Schindler
In engineering projects involving various parts from global suppliers, one common task is to determine which parts are best suited for the project requirements. Information about specific parts' characteristics is published in so called data sheets. However, these data sheets are oftentimes only published in textual form, e.g., as a PDF. Hence, they have to be transformed into a machine-interpretable format. This transformation process still requires a lot of manual intervention and is prone to errors. Automated approaches make use of ontologies to capture the given domain and thus improve automated information extraction from the data sheets. However, ontologies rely solely on experiences and perspectives of their creators at the time of creation and cannot accumulate knowledge over time on their own. This paper presents ConTrOn -- Continuously Trained Ontology -- a system that automatically augments ontologies. ConTrOn tackles terminology problems by combining the knowledge extracted from data sheets with an ontology created by domain experts and external knowledge bases such as WordNet and Wikidata. To demonstrate how the enriched ontology can improve the information extraction process, we selected data sheets from spacecraft development as a use case. The evaluation results show that the amount of information extracted from data sheets based on ontologies is significantly increased after the ontology enrichment.