A survey of provenance in scientific workflow

Abstract

The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Data-intensive experiments using workflows enabled automation and provenance support, which contribute to alleviating the reproducibility crisis. This paper investigates the existing provenance models as well as scientific workflow applications. Furthermore, here we not only summarize the models at different levels, but also compare the applications, particularly the blockchain applied to the provenance in scientific workflows. After that, a new design of secure provenance system is proposed. Provenance that would be enabled by the emerging technology is also discussed at the end.

Keywords

Provenance scientific workflows provenance model blockchain

1. Introduction

With the development of the new paradigm of data-intensive scientific research, research has increasingly become a data-driven knowledge discovery activity.

As a result, researchers in different fields have encountered a serious reproducible crisis [1] in front of a large amount of research data, including raw data, code, description of scientific workflow, etc., which means that many scientific experiments are difficult to reproduce.

Undoubtedly, continuous and repetitive work will consume valuable time of scientific researchers [29]. And it is not conducive to the development and the progress of science. Moreover, the public would lose trust in scientific research if scientific experiments cannot be reproduced and the results of scientific research cannot be verified, which could result in economic losses.

As a common means of managing modern scientific experiments, scientific workflow can provide efficient management for data-intensive scientific research [59], including data provenance, data layout, data management, etc. The application of scientific workflow can help to cope with the reproducibility crisis of research. Scientific workflow systems have played an important role in scientific collaboration in the fields of medicine [51], physics [6], biology [74] and so on.

Figure 1 shows an example of a scientific workflow diagram for the construction of a knowledge graph in the field of forestry. Each task ( $T_{1}$ – $T_{5}$ ) addresses an individual computational step. The workflow starts with data acquisition (task $T_{1}$ ), which takes its input from web crawler (task $T_{2}$ ) and standard files. Corpus is then extracted with TF-IDF and the generated ontologies’ relationship end up to be a provenance model (task $T_{3}$ ). Next, knowledge fusion is performed on the retrieved model (task $T_{4}$ ) in two steps: ontology mapping procedure and knowledge integration. The latter outputs a knowledge graph. Finally, the data is stored in the Neo4j database and visualized (task $T_{5}$ ). The provenance of a scientific workflow catches the derivation steps of data item over a bunch of computational undertakings. Rectangles and edges represent tasks and dependent relation respectively. Edge $o_{n} \to i_{m}$ determines that data item $o_{n}$ is the input of the next linked task. And $P_{l}$ represents the processing of the specific task.

Fig. 1.

Example of scientific workflow diagram for the construction of knowledge graph in forestry field.

Data provenance is a critical aspect of reproducibility [30]. It can improve the replicability and verifiability of scientific experiments by capturing the conversion of information from the source to the final result.

Meanwhile, the caught information offers significant documentation, which is a vital aspect for keeping up with information, evaluating the exactness and origin of the information, and repeating and approving the discoveries [32].

Over the past decade, with the rapid development of computer technology, a variety of scientific workflow systems have also been produced. Such application updates go along with the equipment of the latest computer technologies, from local computer systems [89] to cloud platforms [52] and now to big data clusters [58]. As a result, the concept and implementation of data provenance techniques have been changed.

The following is an example of the Panda provenance system to show the provenance construction case. Panda [49] is a provenance system of data oriented workflows, which inherits data based and process based sources. It also supports a full range of sources from fine to coarse granularity.

We use a detailed fictitious example to illustrate the construction process of provenance. In order to complete an experiment, Alice, a scientific researcher, crawled some data from Internet and output a report after data cleaning and analysis. Later, scientific researcher Tom wants to quote Alice’s data when doing research in Alice related fields, so it is important to understand Alice’s data origin. Tom analyzed the data and found an error in Alice’s data analysis report. Therefore, he captured the data provenance and found the steps to handle Alice’s error. The workflow is shown in the Fig. 2.

The initial input of the workflow is the data crawled on the network and stored in the local database. The workflow includes the following two steps:

Data cleaning: process null values in data, delete duplicate values, and unify date format and naming format

Data analysis: call common data analysis methods for data analysis, and output visual charts. Scientific researcher Alice organizes the results to form a data analysis report

Now, when reading Alice’s data analysis report, researcher Tom finds that a chart does not conform to mathematical rules. Tom wants to find out the cause of this error.

Fig. 2.

Workflow provenance description case, researcher Tom can trace Alice’s errors in the Panda system.

The Panda provenance system can trace back the above fictitious examples well. The architecture of the Panda system is shown in the Fig. 3. In order to track the step of the workflow, the SQLite server in the Panda system provides the following four modules:

Data Tables: used to store all data, including node metadata table, user information table, database metadata table, activity table, etc

SQL Transformation: the SQL transformation engine is used to automatically create provenance predicates for SQL transformation, and to create provenance predicate tables. However, it is not necessary for a most basic provenance system

Workflow Table: used to store information for a workflow

Provenance Predicate tables: Panda uses bipartite graph model induced by provenance predicates as the provenance model. It serializes in the form of triples to be a customer-item-probability triplet.

In this paper, we give a perspective on current data provenance in scientific workflow. At the beginning of this paper, we rearrange the current provenance technique based on the previous literature and focus on the provenance constructed on blockchain in recent years. Subsequently, we compare the provenance techniques applied in scientific workflow to help scientific data management or developers design provenance systems. Then, we propose a provenance system architecture design based on blockchain and proVOC model. Finally, the application of graph databases and knowledge graph technology in the field of provenance in scientific workflow is envisaged.

The remainder of the paper is organized as follows. Section 2 describes the provenance model, which is an indispensable part of building a provenance system in a scientific workflow. Section 3 introduces and contrasts common provenance applications in scientific workflows in both conventional and blockchain technologies. Section 4, the new security provenance system architecture is proposed. Challenges and opportunities in provenance research are given in Section 5. At last, we conclude in Section 6.

Fig. 3.

Architecture of the Panda system, including client, Graphical interface, Panda layer, SQLite server and file system.

2. Provenance model

The provenance model forms the basis for the data provenance system. Generally speaking, the construction of a provenance model is divided into three steps as follows.

Firstly, as a kind of metadata [83], the provenance elements should be predefined [14]. Secondly, the provenance should be transformed into a form that can be processed by computer based on standard definition or self-defined provenance representation model. Last, if the security needs are guaranteed, a provenance security model should also be introduced or constructed specifically.

Currently, numerous provenance models have been developed in different fields and organizations. Such models are built based on different tasks. Some of them have good universality, while others are used for specific data objects or application scenarios.

This section summarizes the provenance structure from three layers: record, representation, and security. To be noted, the record is the annotation which should be captured.

2.1. Record of provenance

The record of provenance is what should be recorded of the data provenance, containing the derived history of the data product, and is gathered as annotations along with descriptions of the source data and procedures. This is an eager form of representation in that readily usable as metadata [7]. However, excessive metadata records often cause a greater burden on the storage, so a reasonable record design of provenance can contribute to improving the efficiency of data provenance.

Early studies only capture the historical origin of a small amount of data and can not achieve the purpose of the whole data provenance. With the increase of data volume and the deepening of research, Buneman et al. [12] put forward Why and Where provenance.

It is used to answer the two questions of which existing data affects the provenance and where the provenance data is located. The Why provenance of an output tuple provides a set of witnesses for that output tuple. Nevertheless, it does not provide additional information on how the output tuple is actually derived. Green uses semirings of polynomials to represent a comprehensive provenance. So, the query results can be expressed as polynomials on provenance semirings, which can infer more detailed data derivative processes. Green named this approach How provenance [43].

However, such taxonomy is not suitable in other fields, such as scientific workflow. Sudha et al. [75] proposed a W7 model and pointed out that the record of provenance should include Who, When, Where, How, Which, What and Why.

The records can be richer and, in addition to the derivation history, often include the parameters passed to the derivation process and the version of the workflow, which will enable reproduction of the data and even related publication references [71].

Farah et al. [90] propose a provenance framework to achieve consistency of provenance across different granularities in a hierarchical manner and support comprehensive and fully re-executable workflows equipped with domain-specific data. The proposed framework takes common workflow language (CWL) as the carrier of workflow, represents provenance based on the PROV representation model, and finally encapsulates the recorded data into research objects for information exchange. They also sorted out the records that should be considered in the construction of the provenance system, and summarized 19 recommendations, which are divided into five categories: Data Sharing, Retrospective Provenance, Prospective Provenance, Execution Environment, Findability & Understandability. The details mentioned in the recommendations are shown in the Fig. 4.

Fig. 4.

19 recommendations summarized in the article [90] published by Farah et al., such recommendation is intended to provide reference for the recording of provenance.

2.2. Representation of provenance

There are a wide range of representation model to represent provenance data. Representation models are utilized to represent specific types and aspects of data provenance information in the format of various vocabularies and ontologies, such as attributes, references, versions, and so on. The first typical general representation model is the OPM [67] which promulgated by the first International Provenance and Annotation Workshop (IPAW). It extends OPMV by defining more constraints using complex OWL2 constructors and defines a core set of rules that describe inter-transaction relationships, including three entities Artifact, Process and Agent that are linked using causal relationships, representing their dependency used, was generated by, was controlled by, was triggered by, and was derived from. In general, the relationship between nodes is described by constructing a graph of data provenance [66].

In the process of using OPM, gradually, many problems with OPM are exposed, such as vague concept terms and usage or improper concept design (e.g., Time, Properties, and Relations) [5]. Therefore, W3C proposed a standard model for PROV based on OPM. Its core is the conceptual data model PROV-DM [60], the structure of which includes Entity, Activity, Agent and seven types of relation sets. In addition to strengthening the description of semantic relationships among data, it also makes the data model constructed too complex. Every relationship between entities is associated with all the relationship types defined in the model. Thus, in the specific application process, the data model needs to be optimized to save the provenance information completely and reduce the number of data tables.

Although PROV is so refined and detailed that most domain expert applications cover only part of the complete recommendation, there are exceptions. Moreau and Missier extend the W3C PROV Data Model (PROV-DM) [68], which is used by most of the provenance community. A provenance graph represents entities, activities, and agents. ProvStore [48] and PROV-WF [26] provide, respectively, a web service to manipulate provenance documents and a runtime provenance that can be queried even during the workflow execution.

ProVOC [19], a provenance representation model published in the form of a Chinese national standard. It consists of three basic categories: Data, Activity and Agent, Data includes two subclasses: parameter and dataset. Among them, parameter includes three subclasses: Temporal parameter, Spatial parameter and Condition parameter. The structure of the ProVOC model is shown in the following Fig. 5.

Fig. 5.

Structure of ProVOC model.

In some special scenarios, the standardized representation model cannot meet the specific application scenarios. To solve this, many specific models have been developed.

In the field of scientific workflow, Roger S et al. propose Provenir [80], and take the Neptune Project as an example of scientific workflow to explain the construction of the provenance model in detail. Similarly, Provenir contains three basic categories, that is, Data, Agent and Process. Since Provenir is built on OWL-DL [64], Data and Agent are defined as specializations of the continuant class of OWL-DL, whereasProcess is a synonym of occurrent.

The concepts proposed by provenir are equivalent to the three top-level concepts of the OPM ontology [65]. Paolo et al. further extended the Provenir model to better represent the domain semantics of workflow. Garijo and Gil propose an Open Provenance Model for Workflows (OPMW) [39]. It gives a structure to distribute computational workflows, which includes the specification of an OPMW ontology for the depiction of workflow provenance and their layouts. Based on OPMW, an approach for the automatic detection of the most common workflow fragments among scientific workflow datasets is created and subsequently [38]. ProvONE [27] is another data model for scientific workflow provenance representation. It was built to be compatible with PROV-DM and provides constructs to model workflow specification and workflow execution provenance. Such models like ProvONE, PROV-DM, and OPMW can catch, store, and search the provenance of a workflow, as well as trace it in a typical, machine-readable format. However, it lacks the ability to correctly and totally determine control-flow driven workflows, leading to incomplete workflow structure and unspecified workflow issues. In this vein, ProvONE+ [13] catch the provenance of workflows by extending the ProvONE.

2.3. Security of provenance

In majority of instances, provenance data is sensitive and a small variation or adjustment leads to a change in the entire chain of the data connected [53]. It is imperative for provenance data to be secured against unauthorized access and to not leak any information about the data against which it is collected, since metadata of provenance is transmitted and stored as data [11].

To date, most studies of provenance security has basically re-applied existing mechanisms, for example, access control, advanced marks, data stream control, or data protection, without answering the semantic questions above. A significant special case is Chong’s proposal to define provenance security guidelines [21], drawing on the provenance model presented by Acar, Ahmed, and Cheney [20]. Uri and Avi [9] use two non-interfering models which protect the workflow and restrict users’ access to nodes to protect data in the process of provenance. As an improvement on the previous model, they (Uri and Avraham) view the provenance as a causality graph with annotations in which each node represents an object and each edge represents a relationship between two objects. Based on this concept, a provenance model has been developed that can interact with other access control models [10]. Hasan et al. [46] created an analytic data provenance threat model based on the encryption and incremental chained signature mechanism, to ensure the integrity and confidentiality of file system provenance information. However, as an attacker is able to strip away the provenance information of a file, the problem of data leakage in malicious environments is not tackled by their approach. Zhang et al. [96] improved the threat model and proposed a method based on checksum to verify the integrity of provenance information in the database. Davidson et al. [31] proposed improvements from the perspective of database privacy and anonymity. In Taha et al. [2], a trusted framework is organized using a Trusted Platform Module(TPM) to guarantee data provenance collected to be admissible, complete, and confidential at the level of the operating system.

In practice, many factors need to be considered in the construction of a traditional security provenance model, such as confidentiality and privacy, as described in [94]. This will make the provenance framework builder have to spend a lot of time thinking about the design of security mechanisms. Fortunately, the application of blockchain technology alleviates the current dilemma.

Blockchain is a growing list of records, securely linked together by blocks distributed across servers (nodes). Because each node of a chain stores all the information in the chain, it is extremely difficult to tamper with the information in the chain. Compared with traditional networks, the blockchain has two core features: tamper-proofing and decentralization. Based on these two characteristics, the information recorded by blockchain is more authentic and reliable, which can help solve the problem of data security in scientific workflow.

Ramachandran et al. [77] developed a secure and immutable scientific data provenance management model framework, smartProvenance, on top of the blockchain. In this case, the framework uses smart contracts and OPM to record provenance information. The blockchain serves as a platform to promote the collection, verification, and management of trusted data sources, so as to prevent any malicious tampering.

Liang et al. [56] proposed ProvChain, an architecture collects and verifies cloud data provenance by embedding the provenance data into blockchain transactions. It provides the following four abilities to audit data operations for cloud storage: Real-time Cloud Data Provenance, Tamper-proof Environment, Enhanced Privacy Preservation, and Provenance Data Validation. Wanghu et al. [17] created ProChain to share scientific workflow provenance in light of blockchain in a community of scientific research in which data provenance is managed on-chain and delivered off-chain. Working on ProChain, scientists can share workflow segments in a trustworthy and reliable way, including location as well as related description metadata.

A scientific data authentication model based on blockchain and distributed ledger, bloxberg, was introduced by Kevin et al. [91], to improve the reusability and integrity of data in scientific workflow.

To help information responsibility and provenance tracking for European Union residents’ data, a private blockchain-based network is being utilized by Ricardo et al. [70].

3. Provenance techniques

Provenance has been applied in many fields. For scientific data provenance, different solutions and technologies have been born from varied disciplines over the last decade. such as Karma [84,85], Vistrails [4], Taverna [47], SPADE [40], Kepler [8] and other special data provenance application systems. This section mainly explores provenance applications in scientific workflow.

Before the advent of blockchain technology, researchers established data provenance applications or systems by building description model and representation models, respectively. For applications that clearly need to ensure data security, means of access control [61], digital signatures, information flow control, privacy or even semantic definitions [22] will be introduced.

At present, with the continuous development of blockchain technology, the construction of security models becomes more convenient, so some data provenance techniques or applications based on blockchain have also been created.

The following subsections respectively introduce and compare the traditional provenance application and blockchain-based provenance application in scientific workflow.

3.1. Traditional provenance application

In the field of scientific workflow, there are many applications for scientists, for instance myExperiment [78], CrowdLabs [62], and KNIME [89], can be used by scientists to publish workflow definitions and share them over the web. In addition, Michaelides et al. [63] introduced a domain-specific PROV-based provenance for help the portability and reproducibility of programming suites. They caught basic elements from the logs of workflow implementations and addressed them utilizing an intermediate notation. Yang et al. [93] introduced the design and implementation of DEEP, an executable document environment that generates scientific results dynamically and interactively and also keeps the provenance of these outcomes in the record. They integrate provenance with the system’s internal data structure by utilizing a specialization of the PROV-DM to depict the behavior and asset association of the system. Provenance is exposed to DEEP users by an interface, which provides the users with varying levels of understanding of the structure of the resource and ways to navigate the document.

Another project, PASS [69,81], was described by Seltzer et al. to track provenance automatically. By recording provenance metadata for each object, it helps scientists get their job done better at a less significant cost. The Lineage File System(LPM) [3] is an instance of a PASS. It focuses on executables, command lines, and input files as the source of provenance like PASS, but it ignores the hardware and software environment in which such processes run. Subsequently, Camflow [72], a continuation of PASS and LPM [3], has a cleaner architecture and is easier to maintain. It is a system-based provenance system that was proposed by Pasquier et al., demonstrating retrospective provenance catch at different levels in the system. Other eminent domain-specific efforts utilizing established standards to document provenance and contextual information are PROV-man [5], PoeM [37], and micropublications [23].

Many systems also have built-in retrospective provenance support, such as VisTrails [36], Taverna [47], Confucius [97], WINGS [42] and PLIER [41]. They all build provenance ontologies based on a representation model such as OPM or PROV-DM to provide data or workflow in a visual interactive way. It even improves the scalability of applications by providing APIs.

All of these efforts use standardized methods for documenting provenance and are therefore related to our work on traditional documenting of retrospective provenance.

3.2. Provenance application base on blockchain

This section gives a brief introduction to blockchain technology and lists some current data provenance applications based on blockchain in scientific workflow systems.

The emergence of blockchain technology provides a convenient interface to establish a security mechanism for data provenance, which enables the developers and researchers of scientific workflow management systems to achieve secure provenance based on the relevant interface of blockchain [57]. Compared with traditional system construction, the blockchain-based system builder can concentrate on the construction of the provenance business without spending too much spirit on solving the security problems, such as whether captured data is tampered with or the database is attacked [44].

The blockchain uses P2P communication protocol, PoW, PoS, PBFT and other consensus algorithms, asymmetric encryption, and database technology to ensure data availability. The blockchain architecture [82] is shown in the Fig. 6, which is generally divided into five layers: network layer, consensus layer, data layer, smart contract layer and application layer. In order to achieve the data tamper resistance, the blockchain has introduced a chain structure with blocks as units. Although the specific details of the data structure of different blockchain platforms are different, the overall architecture is basically the same. The block structure is shown in the Fig. 7. Taking Bitcoin as an example, the elements stored in the block header include header hash, random number, Merkle root, etc. Profit by the unique design of the blockchain, developers can use smart contracts to write blockchain programs, deploy them to the blockchain, and ensure the automatic, transparent and reliable execution of contracts with the help of the trust mechanism jointly maintained by the whole network.

Fig. 6.

Blockchain architecture, including network layer, consensus layer, data layer, smart contract layer and application layer.

Fig. 7.

The two block structures of the Bitcoin blockchain, light nodes need to store less information than full nodes, reducing the burden of it.

Considering the big advantages of blockchain in decentralization, trustworthiness, and high reliability, many provenance applications based on blockchain have been proposed in the field of scientific workflow [18,34].

In view of the problem of data provenance in the scientific data sharing management platform, Hao et al. [45] constructed the layered blockchain architecture of scientific data sharing, and explored the realization mechanisms of interactive information, data blocks, consensus mechanisms, intelligent contracts, etc. Based on Hyperledger Fabric, Gu et al. constructed the alliance chain model [50] for humanities and social science data sharing, which solved the problems of weak provenance ability and untraceable data used in traditional data sharing.

To address the problem of integrity and authenticity of data in scientific workflow, Dinuni et al. [35] developed the SciBlock system based on blockchain to provide data storage for the provenance of scientific workflow, which allows users to query provenance and gives the capacity to nullify some unacceptable or obsolete provenance data without deleting it. Besides, Wittek et al. [70] use a private blockchain-based mechanism to help information responsibility and provenance tracking for European Union occupants’ information. By combining blockchain technology, smart contracts, and metadata-driven data management, Demichev et al. [33] proposed a distributed data provenance management method, ProvHL (Provenance Hyperledger), which supports the storage and exchange of data generated by scientific experiments in a distributed environment. It realizes fault-tolerant and secure provenance management of metadata.

Furthermore, Ramachandran and Kantarcioglu construct a SmartProvenance system on top of smart contracts that further augments the trustworthiness and integrity of data provenance by implementing randomized voting and encryption mechanisms, respectively. It utilizes open provenance model (OPM) to record immutable data provenance [77].

The use of blockchain is often accompanied by cloud environments for scientific workflows.

Zawoad et al. [95] analyzed the threat of trusted provenance in cloud environment under a blockchain application and proposed secure application provenance SECAP (secure application provenance) scheme, which can effectively ensure the integrity and confidentiality required for provenance. Blockcloud [87], a cloud computing provenance framework based on blockchain, improving the POS (proof of stake) [34] consensus mechanism, was built by Tosh. ProvChain [56], a data provenance system based on blockchain, which can provide tamper-proof records, enable the transparency of data accountability in the cloud, and help to enhance the privacy and availability of the provenance data. Different from other provenance systems, Provchain does not build a data presentation layer based on a representation model, but instead customizes a set of metadata to record user actions on data files stored in the cloud. Similarly, Wanghu et al. [17] proposed a blockchain-based system called Prochain for sharing provenance data during the execution of scientific workflows. Moreover, Coelho et al. [24] proposed a BlockFlow architecture with the ProvONE [27] model to provide trust support for scientists on the Science Ecosystem Platform (E-SECO) to perform their collaborative experiments on the cloud platform.

3.3. Comparison

The following compares and analyzes some provenance applications mentioned in this paper from four aspects [73] (see Table 1): Provenance model, Data access, Storage, and Security mechanism. Among them, Provenance model refers to the data model adopted by the corresponding application, Data access refers to where it is provided a way through which prevention data can be accessed and explored [28], Storage indicates the location of the provenance data storage (eg. on-chain or off-chain), and whether to provide a security mechanism is marked by the Security mechanism.

As shown in Table 1, of the 24 proposals we investigated, 10 were based on blockchain to build data provenance applications, and the remaining 14 were traditional. Among them, all applications based on blockchain provide corresponding security mechanisms to ensure the security of provenance data, while only 6 of the 14 traditional provide security mechanisms. This is because the blockchain technology has a data security mechanism, so developers do not need to redesign the security mechanism. Due to the current block transaction information capacity limitation of the blockchain (for example, 1MB in Ethereum [92]), the amount of data that transactions on the public chain can sink in is far lower than that in the local database. Therefore, blockchain based applications have to simplify the records of provenance, or store the provenance data in a combination of on-chain and off-chain. For the same reason, it is very difficult to combine the provenance model in the blockchain. Only 4 of the 10 blockchain based applications are built based on the provenance model, while 11 of the 14 traditional applications are built based on the provenance model. Although the current based blockchain applications are tamper proof and some also try to combine the provenance model to better support data interchange on the web, intuitive data access means have not been provided, which needs to be improved for blockchain based applications.

In general, according to the collation and comparison of 24 data provenance applications in this paper, the following conclusions are drawn. Building a provenance system based on blockchain can provide sufficient data security mechanism, but the current proposal has less application of provenance model, and the way of data access needs to be further improved. In the future, it is necessary to carry out more research on the capacity limitation of blocks, and expand the block capacity as much as possible to store more provenance data without affecting the data consistency and under the acceptable consensus duration.

Table 1
Comparison of some provenance applications sorted out in this paper

Num Time Proposal Provenance model Data Access Storage Security Mechanism

1 2021 LineageChain [79] – APIs On-chain YES

2 2021 [91] – ERC721Metadata¹ APIs On-chain YES

3 2020 BlockFlow [24] ProvONE APIs On-chain YES

4 2019 SciBlock [35] – Graph, web interface, Bloom filter On-chain, off-chain YES

5 2019 ProvHL [33] – Smart contracts On-chian YES

6 2018 Smart Provenance OPM Smart contracts On-chain YES

7 2018 Prochain [56] – Transaction tracing, Bloom filter On-chain YES

8 2017 Camflow [72] PROV-DM extend NetFilter hooks, LSM hooks Relational DB, Neo4j, Spark YES

9 2017 DataProv [76] OPM Web interface smart contracts On-chain, off-chain YES

10 2017 ProvChain [56] – Tierion API Provenance Database YES

11 2017 [86] PROV-DM BlockChain transaction BigChainDB, RethinkDBk, SQLite YES

12 2016 SECAP [95] – PMS² Provenance Database YES

13 2015 LPM [3] PROV-DM PB-DLP³ Gzip, SNAP recorders, Neo4j, PostGreSQL YES

14 2011 WINGS [42] OPM Graph API OWL, Relational DB YES

16 2011 PLIER [41] OPM Query Relational DB, XML, RDF, DOT NO

16 2008 SPADE [40] OPM Graph with Neo4j or Graphviz Neo4j, Relational DB YES

17 2008 Kepler [8] OPM/PROV Graph, Query File System, XML, RDF, MoML YES

18 2008 SecProv [15] – – – YES

19 2007 ZOOM [25] – Graph, Zoom UserViews Relational DB NO

20 2007 [16] Provenance Ontology Graph, ontology-based query Relational DB, OWL, RDF NO

21 2006 Taverna [47] OPM/PROV Graph Relational DB, XML, RDF YES

22 2006 Karma [84,85] OPM Graph, Karma Prov Browser Relational DB, XML –

23 2005 PASS [69,81] OPM Paths, web interface RDBMS, Berkeley DB YES

24 2005 VisTrails [36] OPM Graph/Query Relational DB XML, RDF NO

Num	Time	Proposal	Provenance model	Data Access	Storage	Security Mechanism
1	2021	LineageChain [79]	–	APIs	On-chain	YES
2	2021	[91]	–	ERC721Metadata¹ APIs	On-chain	YES
3	2020	BlockFlow [24]	ProvONE	APIs	On-chain	YES
4	2019	SciBlock [35]	–	Graph, web interface, Bloom filter	On-chain, off-chain	YES
5	2019	ProvHL [33]	–	Smart contracts	On-chian	YES
6	2018	Smart Provenance	OPM	Smart contracts	On-chain	YES
7	2018	Prochain [56]	–	Transaction tracing, Bloom filter	On-chain	YES
8	2017	Camflow [72]	PROV-DM extend	NetFilter hooks, LSM hooks	Relational DB, Neo4j, Spark	YES
9	2017	DataProv [76]	OPM	Web interface smart contracts	On-chain, off-chain	YES
10	2017	ProvChain [56]	–	Tierion API	Provenance Database	YES
11	2017	[86]	PROV-DM	BlockChain transaction	BigChainDB, RethinkDBk, SQLite	YES
12	2016	SECAP [95]	–	PMS²	Provenance Database	YES
13	2015	LPM [3]	PROV-DM	PB-DLP³	Gzip, SNAP recorders, Neo4j, PostGreSQL	YES
14	2011	WINGS [42]	OPM	Graph API	OWL, Relational DB	YES
16	2011	PLIER [41]	OPM	Query	Relational DB, XML, RDF, DOT	NO
16	2008	SPADE [40]	OPM	Graph with Neo4j or Graphviz	Neo4j, Relational DB	YES
17	2008	Kepler [8]	OPM/PROV	Graph, Query	File System, XML, RDF, MoML	YES
18	2008	SecProv [15]	–	–	–	YES
19	2007	ZOOM [25]	–	Graph, Zoom UserViews	Relational DB	NO
20	2007	[16]	Provenance Ontology	Graph, ontology-based query	Relational DB, OWL, RDF	NO
21	2006	Taverna [47]	OPM/PROV	Graph	Relational DB, XML, RDF	YES
22	2006	Karma [84,85]	OPM	Graph, Karma Prov Browser	Relational DB, XML	–
23	2005	PASS [69,81]	OPM	Paths, web interface	RDBMS, Berkeley DB	YES
24	2005	VisTrails [36]	OPM	Graph/Query	Relational DB XML, RDF	NO

¹ https://eips.ethereum.org/EIPS/eip-721

²Provenance Manager System.

³Provenance-Based Data Loss Prevention.

4. A new security provenance model

Based on the provenance model ProVOC and blockchain, this paper proposes a secure scientific workflow data provenance framework, Fig. 8. The model consists of the following parts:

The input data model specifies the data source information monitored by the provenance system, including metadata of agency, metadata of activity, metadata of dataset

Data serialization, taking RDF as the data model, serializes the data source information into JSON-LD or Triples format for data transmission and exchange

Neo4j database is used to improve the efficiency of data query, data tracking and process reproduction. Using graph database to store data, in this scenario of data provenance, graph database has more advantages than relational database [88]

Blockchain, based on the smart contract, we can link the provenance information to verify whether the information is reliable and has not been tampered with

When users create workflows in the provenance system, the system will automatically capture the provenance information. When the workflow is created and executed, the relevant provenance information will be serialized in the form of RDF and synchronized to the Neo4j and the blockchain respectively. When user needs to analyze the workflow, he can query it through the Cypher interface provided by Neo4j and obtain the entire data of workflow. Finally, if the current workflow information needs to be verified for tampering, the user can call the verification interface provided by the blockchain for verification.

The advantage of this framework is that it combines ProVOC provenance model and blockchain, has practical feasibility, and stores data in the form of RDF, which can improve the interoperability between the current system and other provenance systems. The graph database Neo4j is used to store the information. Compared with the traditional relational database, it has more advantages in data analysis and visualization.

Fig. 8.

Example of scientific workflow diagram for the construction of knowledge graph in forestry field.

5. Research challenges and opportunities

Contrasting the initial use cases and what can actually be achieved with current provenance systems makes it clear that research is needed in a number of areas.

Querying and inference. The research on semantic-based provenance has made great progress. Although a variety of semantic models have been used in the provenance of workflow, the inference potential of semantic models has not fully burst out. These are still challenging problems. In recent years, benefit of the rapid development of Machine Learning and Deep Learning, the Knowledge Graph technique based on Natural Language Processing technology has also begun to be studied and applied in various fields. The Knowledge Graph based on various graph databases also provides serve such as data lineage analysis and data provenance, and can achieve satisfactory query speed compared with the provenance system based on traditional relational database.

Delicacy provenance. Big data technologies, such as Spark and Flink, have been widely used in e-commerce, short-video platforms, and other fields. How to build a finer provenance to provide fine-grained data collection and information, allowing researcher behavior and system error prediction, limiting error propagation, or self-diagnosing changes in output quality will be the direction of future research.

Security and privacy. Although blockchain technology has brought us dawn, security is still a major challenge at present. For every important scientific research, it is necessary to build a comprehensive security scientific workflow system from the hardware layer to the application layer. For example, there are studies that use differential privacy to protect personal privacy data [55], and use searchable symmetric encryption(SSE) to prevent database information disclosure [54]. The construction of a complete security system can ensure that key scientific research achievements would not be stolen by hackers.

6. Conclusion

In this paper, we provide a systematic literature review of published studies that focus on provenance in scientific workflow. First, we introduced the scientific workflow and provenance. In particular, we illustrate the representation of scientific workflows with real-examples, namely the construction of forestry knowledge graphs. Then, the provenance model was introduced in three levels: the record model, representation model, and the security model. This is immediately followed by the multiple provenance applications shown in the scientific workflows and demonstrated by the usage of blockchain. Finally, we propose a provenance system architecture design based on blockchain and proVOC model and discuss future research directions that could include hotspots such as big data, machine learning, and knowledge graphs.

Footnotes

Acknowledgement

This study is funded by the Guangdong Science and Technology Plan Project (2021B1212100004, 2019B010139001), Guangzhou Science and Technology Plan Project (201902020016) and Guangdong Natural Science Fund Project (2021A1515011243).

Conflict of interest

None to report.

References

Baker , Reproducibility crisis, Nature 533(26) (2016), 353–366.

M.M.

Bany Taha ,

Chaisiri and

R.K.L.

Ko , Trusted tamper-evident data provenance, in: 2015 IEEE Trustcom/BigDataSE/ISPA, Vol. 1, 2015, pp. 646–653. doi:10.1109/Trustcom.2015.430.

Bates ,

D.J.

Tian ,

K.R.B.

Butler and

Moyer , Trustworthy whole-system provenance for the linux kernel, Usenix Security Symposium (2015).

Bavoil ,

S.P.

Callahan ,

Crossno ,

Freire ,

Scheidegger ,

C.T.

Silva and

H.T.

Vo , Vistrails: Enabling interactive multiple-view visualizations, IEEE Visualization (2005).

Benabdelkader ,

A.H.C.

van Kampen and

S.D.

Olabarriaga , Prov-man: A prov-compliant toolkit for provenance management, PeerJ (2015).

J.C.

Bennett ,

Bhagatwala ,

J.H.

Chen ,

Pinar ,

Salloum and

Seshadhri , Trigger detection for adaptive scientific workflows using percentile sampling, SIAM Journal on Scientific Computing 38(5) (2016), 240–263. doi:10.1137/15M1027942.

Bhagwat ,

Chiticariu ,

W.-C.

Tan and

Vijayvargiya , An annotation management system for relational databases, Very Large Data Bases, 2004.

Bowers ,

T.M.

McPhillips ,

Riddle ,

M.K.

Anand and

Ludäscher , Kepler/ppod: Scientific workflow and provenance support for assembling the tree of life, 2008.

Braun and

Shinnar , A security model for provenance, 2006.

10.

Braun ,

Shinnar and

Seltzer , Securing provenance, Usenix Security Symposium (2008).

11.

Braun ,

Shinnar and

M.I.

Seltzer , Securing provenance, in: 3rd USENIX Workshop on Hot Topics in Security, Proceedings, San Jose, CA, USA, July 29, 2008, 2008.

12.

Buneman ,

Khanna and

Wang-Chiew , Why and where: A characterization of data provenance, in: International Conference on Database Theory, Springer, 2001, pp. 316–330.

13.

A.S.

Butt and

Fitch , Provone+: A provenance model for scientific workflows, in: International Conference on Web Information Systems Engineering, Springer, 2020, pp. 431–444.

14.

Carata ,

Akoush ,

Balakrishnan ,

Bytheway ,

Sohan ,

Seltzer and

Hopper , A primer on provenance, Communications of The ACM 57 (2014), 52–60. doi:10.1145/2596628.

15.

Chebotko ,

Chang ,

Lu ,

Fotouhi and

Yang , Scientific workflow provenance querying with security views, in: 2008 the Ninth International Conference on Web-Age Information Management, 2008, pp. 349–356. doi:10.1109/WAIM.2008.41.

16.

Chebotko ,

Fei ,

Lin ,

Lu and

Fotouhi , Storing and querying scientific workflow provenance metadata using an rdbms, in: Third IEEE International Conference on e-Science and Grid Computing (e-Science 2007), IEEE, 2007, pp. 611–618. doi:10.1109/E-SCIENCE.2007.70.

17.

Chen ,

Liang ,

Li ,

Qin ,

Mu and

Wang , Blockchain based provenance sharing of scientific workflows, International Conference on Big Data (2018).

18.

Chen ,

Liang ,

Li ,

Qin ,

Mu and

Wang , Blockchain based provenance sharing of scientific workflows, in: 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 3814–3820. doi:10.1109/BigData.2018.8622237.

19.

Chen ,

Hu ,

Zhu ,

Gao and

Li , Research and popularization of national standard for data provenance descriptive model, Standard Science 4 (2019), 108–112, (in chinese).

20.

Cheney ,

Acar and

Ahmed , Provenance traces, 2008, arXiv preprint arXiv:0812.0564.

21.

Chong , Towards semantics for provenance security, in: TAPP’09 First Workshop on Theory and Practice of Provenance, 2009.

22.

Chong , Towards semantics for provenance security, in: TAPP’09 First Workshop on on Theory and Practice of Provenance, 2009.

23.

Clark ,

Ciccarese and

Goble , Micropublications: A semantic model for claims, evidence, arguments and annotations in biomedical communications, Journal of Biomedical Semantics (2013).

24.

Coelho ,

Braga ,

J.M.N.

David ,

M.A.R.

Dantas ,

Ströele and

Campos , Blockchain for reliability in collaborative scientific workflows on cloud platforms, in: International Symposium on Computers and Communications, 2020.

25.

Cohen-Boulakia ,

Biton ,

Cohen and

S.B.

Davidson , Addressing the provenance challenge using zoom, Concurrency and Computation: Practice and Experience (2008).

26.

Costa ,

Silva ,

de Oliveira ,

K.A.C.S.

Ocaña ,

Ogasawara ,

Dias and

Mattoso , Capturing and querying workflow runtime provenance with prov: A practical approach, Edbt Icdt Workshops (2013).

27.

Cuevas-Vicenttin ,

Ludäscher and

Missier , Provone: A Prov Extension Data Model for scientific workflow provenance, 2016, website https://purl.dataone.org/provone-v1-dev.

28.

S.M.S.

da Cruz ,

M.L.M.

Campos and

Mattoso , Towards a taxonomy of provenance in scientific workflow management systems, in: 2009 Congress on Services-I, IEEE, 2009, pp. 259–266.

29.

R.F.

da Silva ,

Filgueira ,

Pietri ,

Jiang ,

Sakellariou and

Deelman , A characterization of workflow management systems for extreme-scale applications, Future Generation Computer Systems 75 (2017), 228–238. doi:10.1016/j.future.2017.02.026.

30.

S.B.

Davidson and

Freire , Provenance and scientific workflows: Challenges and opportunities, International Conference on Management of Data 2008.

31.

S.B.

Davidson ,

Khanna ,

Roy ,

Stoyanovich ,

Tannen and

Chen , On provenance and privacy, International conference on database theory (2011).

32.

Deelman ,

Gannon ,

Shields and

Taylor , Workflows and e-science: An overview of workflow system features and capabilities, Future Generation Computer Systems (2009).

33.

Demichev ,

Dubenskaya ,

Fedotova ,

Kryukov ,

Polyakov and

N.V.

Prikhod’ko , Provenance Metadata Management in Distributed Storages Using the Hyperledger Blockchain Platform, CEUR Workshop Proceedings, 2019.

34.

Demichev ,

Kryukov and

Prikhodko , The approach to managing provenance metadata and data access rights in distributed storage using the hyperledger blockchain platform, in: 2018 Ivannikov Ispras Open Conference (ISPRAS), IEEE, 2018, pp. 131–136. doi:10.1109/ISPRAS.2018.00028.

35.

Fernando ,

Kulshrestha ,

J.D.

Herath ,

Mahadik ,

Ma ,

Bai ,

Yang ,

Yan and

Lu , Sciblock: A blockchain-based tamper-proof non-repudiable storage for scientific workflow provenance, Color Imaging Conference (2019).

36.

Freire and

C.T.

Silva , Making computations and publications reproducible with vistrails, Computing in Science and Engineering (2012).

37.

Gaignard ,

Skaf-Molli and

Bihouée , From scientific workflow patterns to 5-star linked open data, 2016.

38.

Garijo ,

Corcho and

Gil , Detecting common scientific workflow fragments using templates and execution provenance, International Conference on Knowledge Capture (2013).

39.

Garijo and

Gil , A new approach for publishing workflows: Abstractions, standards, and linked data, in: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science, 2011, pp. 47–56. doi:10.1145/2110497.2110504.

40.

Gehani and

Tariq , Spade: Support for provenance auditing in distributed environments, in: ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Springer, 2012, pp. 101–120.

41.

Gerhards ,

Skorupa ,

Sander ,

Belloum ,

Vasunin and

Benabdelkader , Hist/plier: A two-fold provenance approach for grid-enabled scientific workflows using ws-vlam, in: 2011 IEEE/ACM 12th International Conference on Grid Computing, 2011, pp. 224–225. doi:10.1109/Grid.2011.39.

42.

Gil ,

Ratnakar ,

Kim ,

Moody ,

Deelman ,

P.A.

González-Calero and

Groth , Wings: Intelligent workflow-based design of computational experiments, IEEE Intelligent Systems (2011).

43.

T.J.

Green ,

Karvounarakis and

Tannen , Provenance Semirings. Symposium on Principles of Database Systems, 2007.

44.

Gupta ,

Tanwar ,

Kumar and

Tyagi , Blockchain-based security attack resilience schemes for autonomous vehicles in industry 4.0: A systematic review, Computers & Electrical Engineering 86 (2020), 106717. doi:10.1016/j.compeleceng.2020.106717.

45.

Hao and

Deng , Research on scientific data sharing management integrating data supervision and data traceability, Information Studies: Theory & Application 41(3) (2018), 6, (in chinese).

46.

Hasan ,

Sion and

Winslett , The case of the fake picasso: Preventing history forgery with secure provenance.

47.

Hull ,

Wolstencroft ,

Stevens ,

Goble ,

Pocock ,

Li and

Oinn , Taverna: A tool for building and running workflows of services, Nucleic Acids Research (2006).

48.

T.D.

Huynh and

Moreau , Provstore: A public provenance repository. International Provenance and Annotation Workshop, (2014).

49.

Ikeda and

Widom , Panda: A system for provenance and data, Bulletin of the Technical Committee on Data Engineering (2010).

50.

Jun and

Xin , Design and implementation of a humanities and social sciences data sharing model: A case study of consortium blockchain, Journal of the China Society for Scientific and Technical Information 38(4) (2019), 14, (in chinese).

51.

C.C.

Kannas ,

Kalvari ,

Lambrinidis ,

C.M.

Neophytou ,

C.G.

Savva ,

Kirmitzoglou ,

Antoniou ,

K.G.

Achilleos ,

Scherf ,

C.A.

Pitta et al., Lisis: An online scientific workflow system for virtual screening, Combinatorial Chemistry & High Throughput Screening 18(3) (2015), 281–295. doi:10.2174/1386207318666150305123341.

52.

Kranjc ,

Podpečan and

Lavrač , Clowdflows: A cloud based scientific workflow platform, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2012, pp. 816–819. doi:10.1007/978-3-642-33486-3_54.

53.

K.P.

Kumar and

R.C.

Cherukuri , Securing provenance data with secret sharing mechanism: Model perspective, International Conference Data Science (2019).

54.

Li ,

Huang ,

Wei ,

Lv ,

Liu ,

Dong and

Lou , Searchable symmetric encryption with forward search privacy, IEEE Transactions on Dependable and Secure Computing 18(1) (2019), 460–474. doi:10.1109/TDSC.2019.2894411.

55.

Li ,

Ye ,

Li ,

Wang ,

Lou ,

Hou ,

Liu and

Lu , Efficient and secure outsourcing of differentially private data publishing with multiple evaluators, IEEE Transactions on Dependable and Secure Computing (2020).

56.

Liang ,

Shetty ,

D.K.

Tosh ,

C.A.

Kamhoua ,

Kwiat and

Njilla , Provchain: A blockchain-based data provenance architecture in cloud environment with enhanced privacy and availability, IEEE ACM International Symposium Cluster Cloud and Grid Computing (2017).

57.

Linoy ,

Ray and

Stakhanova , Towards eidetic blockchain systems with enhanced provenance, in: 2020 IEEE 36th International Conference on Data Engineering Workshops (ICDEW), 2020, pp. 7–10. doi:10.1109/ICDEW49219.2020.00-14.

58.

Liu ,

Lu and

Che , A survey of modern scientific workflow scheduling algorithms and systems in the era of big data, in: 2020 IEEE International Conference on Services Computing (SCC), IEEE, 2020, pp. 132–141. doi:10.1109/SCC49832.2020.00026.

59.

Liu ,

Pacitti ,

Valduriez and

Mattoso , A survey of data-intensive scientific workflow management, Grid Computing (2015).

60.

P.M.

Luc Moreau : PROV-DM: The PROV Data Model. version http://www.w3.org/TR/prov-dm/ (2013).

61.

Martinho ,

Domingos and

A.R.

Silva , Supporting authentication requirements in workflows, in: ICEIS, Vol. 3, 2006, pp. 181–188.

62.

Mates ,

Santos ,

Freire and

C.T.

Silva , Crowdlabs: Social analysis and visualization for the sciences, Statistical and Scientific Database Management (2011).

63.

Mattoso and

Glavic , Provenance and annotation of data and processes, 2016.

64.

Melnik , 3. Implementation and Applications, 2004.

65.

Missier ,

S.S.

Sahoo ,

Zhao ,

Goble and

A.P.

Sheth , Janus: From workflows to semantic provenance and linked open data, International Provenance and Annotation Workshop (2010).

66.

Moreau ,

Clifford ,

Freire ,

Futrelle ,

Gil ,

Groth ,

Kwasnikowska ,

Miles ,

Missier ,

J.D.

Myers ,

Plale ,

Simmhan ,

E.G.

Stephan and

J.V.

den Bussche , The open provenance model core specification (v1.1), Future Generation Computer Systems (2011).

67.

Moreau ,

Freire ,

Futrelle ,

R.E.

McGrath ,

Myers and

Paulson , The open provenance model: An overview, International Provenance and Annotation Workshop (2008).

68.

Moreau ,

Missier ,

Belhajjame ,

B’Far ,

Cheney ,

Coppens ,

Cresswell ,

Gil ,

Groth ,

Klyne et al., Prov-Dm: The Prov Data Model, W3C, 2013.

69.

K.-K.

Muniswamy-Reddy ,

D.A.

Holland ,

Braun and

M.I.

Seltzer , Provenance-aware storage systems, in: Usenix Annual Technical Conference, General Track, 2006, pp. 43–56.

70.

Neisse ,

Steri and

Nai-Fovino , A blockchain-based approach for data accountability and provenance tracking, Cryptography and Security (2017), arXiv.

71.

Pancerella ,

Hewson ,

Koegler ,

Leahy ,

Lee ,

Rahn ,

Yang ,

J.D.

Myers ,

Didier ,

McCoy et al., Metadata in the collaboratory for multi-scale chemical science, in: International Conference on Dublin Core and Metadata Applications, 2003, pp. 121–129.

72.

Pasquier ,

Han ,

Goldstein ,

Moyer ,

Eyers ,

Seltzer and

Bacon , Practical whole-system provenance capture. Symposium on Cloud Computing (2017).

73.

Pérez ,

Rubio and

Sáenz-Adán , A systematic review of provenance systems, Knowledge and Information Systems 57(3) (2018), 495–543. doi:10.1007/s10115-018-1164-3.

74.

Pradal ,

Artzet ,

Chopard ,

Dupuis ,

Fournier ,

Mielewczik ,

Negre ,

Neveu ,

Parigot ,

Valduriez et al., Infraphenogrid: A scientific workflow infrastructure for plant phenomics on the grid, Future Generation Computer Systems 67 (2017), 341–353. doi:10.1016/j.future.2016.06.002.

75.

Ram ,

Liu et al., A new perspective on semantics of data provenance, SWPM 526 (2009).

76.

Ramachandran and

Kantarcioglu , Using blockchain and smart contracts for secure data provenance management, Cryptography and Security (2017), arXiv.

77.

Ramachandran and

Kantarcioglu , Smartprovenance: A distributed, blockchain based dataprovenance system, Conference on Data and Application Security and Privacy (2018).

78.

D.D.

Roure ,

Goble and

Stevens , The design and realisation of the experimentmy virtual research environment for social sharing of workflows, Future Generation Computer Systems (2009).

79.

Ruan ,

T.T.A.

Dinh ,

Lin ,

Zhang ,

Chen and

B.C.

Ooi , Lineagechain: A fine-grained, secure and efficient data provenance system for blockchains, Very Large Data Bases (2021).

80.

S.S.

Sahoo and

A.P.

Sheth , Provenir ontology: Towards a framework for escience provenance management, 2009.

81.

M.I.

Seltzer ,

K.-K.

Muniswamy-Reddy ,

D.A.

Holland ,

Braun and

Ledlie , Provenance-Aware Storage Systems, 2005.

82.

Q.-F.

Shao ,

C.-Q.

Jin ,

Zhang ,

W.-N.

Qian and

A.-Y.

Zhou , Blockchain: Architecture and research progress, Chinese Journal of Computers 41(5) (2018), 20.

83.

Simmhan ,

Plale and

Gannon , A survey of data provenance techniques, 2005.

84.

Simmhan ,

Plale and

Gannon , Karma2: Provenance management for data-driven workflows, International Journal of Web Services Research (2008).

85.

Simmhan ,

Plale ,

Gannon and

Marru , Performance evaluation of the karma provenance framework for scientific workflows, International Provenance and Annotation Workshop (2006).

86.

Stoffers , Trustworthy provenance recording using a blockchain-like database, 2017.

87.

D.K.

Tosh ,

Shetty ,

Liang ,

Kamhoua and

Njilla , Consensus protocols for blockchain-based data provenance: Challenges and opportunities, in: 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), IEEE, 2017, pp. 469–474. doi:10.1109/UEMCON.2017.8249088.

88.

Vicknair ,

Macias ,

Zhao ,

Nan ,

Chen and

Wilkins , A comparison of a graph database and a relational database: A data provenance perspective, in: Proceedings of the 48th Annual Southeast Regional Conference. ACM SE’10, Association for Computing Machinery, New York, NY, USA, 2010. doi:10.1145/1900008.1900067.

89.

W.A.

Warr , Scientific workflow systems: Pipeline pilot and knime, Journal of Computer-Aided Molecular Design 26(7) (2012), 801–804. doi:10.1007/s10822-012-9577-7.

90.

Wilkinson ,

Dumontier ,

I.J.

Aalbersberg ,

Appleton ,

Axton ,

Baak ,

Blomberg ,

J.-W.

Boiten ,

L.O.B.

da Silva Santos ,

P.E.

Bourne ,

Bouwman ,

A.J.

Brookes ,

Clark ,

Crosas ,

Dillo ,

O.G.

Dumon ,

S.C.

Edmunds ,

C.T.

Evelo ,

Finkers ,

Gonzalez-Beltran ,

A.J.G.

Gray ,

Groth ,

Goble ,

J.S.

Grethe ,

Heringa ,

P.A.C.

’t Hoen ,

Hooft ,

Kuhn ,

Kok ,

J.N.

Kok ,

S.J.

Lusher ,

M.E.

Martone ,

Mons ,

A.L.

Packer ,

Persson ,

Rocca-Serra ,

Roos ,

van Schaik ,

S.-A.

Sansone ,

E.A.

Schultes ,

Sengstag ,

Slater ,

Strawn ,

M.A.

Swertz ,

Thompson ,

van der Lei ,

E.M.

van Mulligen ,

Velterop ,

Waagmeester ,

Wittenburg ,

Wolstencroft ,

Zhao and

Mons (eds), The Fair Guiding Principles for Scientific Data Management and Stewardship, Scientific Data, 2016.

91.

Wittek ,

Lawton ,

Dohndorf ,

Weinert and

Ionita , A blockchain-based approach to provenance and reproducibility in research workflows, in: 2021 IEEE International Conference on Blockchain and Cryptocurrency (ICBC), 2021.

92.

Wood , Ethereum: A secure decentralised generalised transaction ledger, 2013.

93.

Yang ,

D.T.

Michaelides ,

C.M.J.

Charlton ,

W.J.

Browne and

Moreau , Deep: A provenance-aware executable document system, International Provenance and Annotation Workshop (2012).

94.

Zafar ,

Khan ,

Suhail ,

Ahmed ,

Hameed ,

H.M.

Khan ,

Jabeen and

Anjum , Trustworthy data: A survey, taxonomy and future trends of secure provenance schemes, Journal of Network and Computer Applications 94 (2017), 50–68. doi:10.1016/j.jnca.2017.06.003.

95.

Zawoad and

Hasan , Secap: Towards securing application provenance in the cloud, in: 2016 IEEE 9th International Conference on Cloud Computing (CLOUD), IEEE, 2016, pp. 900–903. doi:10.1109/CLOUD.2016.0132.

96.

Zhang ,

Chapman and

LeFevre , Do you know where your data’s been? – Tamper-evident database provenance, in: Workshop on Secure Data Management, Springer, 2009, pp. 17–32. doi:10.1007/978-3-642-04219-5_2.

97.

Zhang ,

Kuc and

Lu , Confucius: A tool supporting collaborative scientific workflow composition, IEEE Transactions on Services Computing (2014).