13.9DCMay 22
An Ecosystem of Services for FAIR Computational WorkflowsSean R. Wilkinson, Johan Gustafsson, Finn Bacall et al.
Computational workflows represent major investments of effort and expertise. As first-class, publishable research objects of their own, they are key to sharing methodological know-how for reuse, reproducibility, and transparency. Thus, the application of the FAIR Principles to workflows is inevitable to enable them to be Findable, Accessible, Interoperable, and Reusable. Making workflows FAIR reduces duplication of effort, assists in the reuse of best practice approaches and community-supported standards, and ensures that workflows as digital objects can support reproducible, robust science. FAIR workflows draw from both FAIR data and software principles, and they help ensure and support data FAIRification. The FAIR Principles emphasize the association of persistent identifiers and machine-actionable metadata with workflows. Implementing the Principles requires a framework with appropriate programmatic protocols and an accompanying ecosystem of services, tools, policies, and best practices, as well the buy-in of existing workflow systems. The European EOSC-Life Workflow Collaboratory is an example of such a digital infrastructure for the Biosciences. It includes a metadata standards framework for describing workflows that is managed and used by dedicated new FAIR workflow services and programmatic APIs for interoperability and metadata access. It includes the WorkflowHub registry and LifeMonitor workflow testing service, and it incorporates existing workflow systems and packaging solutions. Here, we introduce the FAIR Principles for workflows and connect FAIR workflows with the FAIR ecosystems they inhabit with the EOSC-Life Collaboratory as a concrete example. We also introduce other community efforts that are easing the ways that workflows are shared and reused by others, and we discuss how the variations in different workflow settings impact their FAIR perspectives.
CLSep 20, 2016
A framework for mining process models from emails logsDiana Jlailaty, Daniela Grigori, Khalid Belhajjame
Due to its wide use in personal, but most importantly, professional contexts, email represents a valuable source of information that can be harvested for understanding, reengineering and repurposing undocumented business processes of companies and institutions. Towards this aim, a few researchers investigated the problem of extracting process oriented information from email logs in order to take benefit of the many available process mining techniques and tools. In this paper we go further in this direction, by proposing a new method for mining process models from email logs that leverage unsupervised machine learning techniques with little human involvement. Moreover, our method allows to semi-automatically label emails with activity names, that can be used for activity recognition in new incoming emails. A use case demonstrates the usefulness of the proposed solution using a modest in size, yet real-world, dataset containing emails that belong to two different process models.
SEMay 21, 2016
Automatic vs Manual Provenance Abstractions: Mind the GapPinar Alper, Khalid Belhajjame, Carole A. Goble
In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions.
SEFeb 9, 2015
YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from ScriptsTimothy McPhillips, Tianhong Song, Tyler Kolisnik et al.
Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.
DLApr 26, 2013
PAV ontology: Provenance, Authoring and VersioningPaolo Ciccarese, Stian Soiland-Reyes, Khalid Belhajjame et al.
Provenance is a critical ingredient for establishing trust of published scientific content. This is true whether we are considering a data set, a computational workflow, a peer-reviewed publication or a simple scientific claim with supportive evidence. Existing vocabularies such as DC Terms and the W3C PROV-O are domain-independent and general-purpose and they allow and encourage for extensions to cover more specific needs. We identify the specific need for identifying or distinguishing between the various roles assumed by agents manipulating digital artifacts, such as author, contributor and curator. We present the Provenance, Authoring and Versioning ontology (PAV): a lightweight ontology for capturing just enough descriptions essential for tracking the provenance, authoring and versioning of web resources. We argue that such descriptions are essential for digital scientific content. PAV distinguishes between contributors, authors and curators of content and creators of representations in addition to the provenance of originating resources that have been accessed, transformed and consumed. We explore five projects (and communities) that have adopted PAV illustrating their usage through concrete examples. Moreover, we present mappings that show how PAV extends the PROV-O ontology to support broader interoperability. The authors strived to keep PAV lightweight and compact by including only those terms that have demonstrated to be pragmatically useful in existing applications, and by recommending terms from existing ontologies when plausible. We analyze and compare PAV with related approaches, namely Provenance Vocabulary, DC Terms and BIBFRAME. We identify similarities and analyze their differences with PAV, outlining strengths and weaknesses of our proposed model. We specify SKOS mappings that align PAV with DC Terms.