The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes

The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various ow...

Full description

Bibliographic Details
Main Author: Groth, Paul
Published: University of Southampton 2007
Subjects:
004
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.561452
id ndltd-bl.uk-oai-ethos.bl.uk-561452
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-5614522018-09-05T03:27:11ZThe origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processesGroth, Paul2007The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the work’s value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital ob jects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the prove- nance of digital ob jects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problem are not designed specifically for multi-institutional scien- tific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems’ processes, the determina- tion of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a system’s past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital ob ject to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a system’s processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high- quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital ob jects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run- time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management.004University of Southamptonhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.561452https://eprints.soton.ac.uk/264649/Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 004
spellingShingle 004
Groth, Paul
The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
description The Oxford English Dictionary defines provenance as (i) the fact of coming from some particular source or quarter; origin, derivation. (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concr., a record of the ultimate derivation and passage of an item through its various owners. In art, knowing the provenance of an artwork lends weight and authority to it while providing a context for curators and the public to understand and appreciate the work’s value. Without such a documented history, the work may be misunderstood, unappreciated, or undervalued. In computer systems, knowing the provenance of digital ob jects would provide them with greater weight, authority, and context just as it does for works of art. Specifically, if the prove- nance of digital ob jects could be determined, then users could understand how documents were produced, how simulation results were generated, and why decisions were made. Provenance is of particular importance in science, where experimental results are reused, reproduced, and verified. However, science is increasingly being done through large-scale collaborations that span multiple institutions, which makes the problem of determining the provenance of scientific results significantly harder. Current approaches to this problem are not designed specifically for multi-institutional scien- tific systems and their evolution towards greater dynamic and peer-to-peer topologies. Therefore, this thesis advocates a new approach, namely, that through the autonomous creation, scalable recording, and principled organisation of documentation of systems’ processes, the determina- tion of the provenance of results produced by complex multi-institutional scientific systems is enabled. The dissertation makes four contributions to the state of the art. First is the idea that provenance is a query performed over documentation of a system’s past process. Thus, the problem is one of how to collect and collate documentation from multiple distributed sources and organise it in a manner that enables the provenance of a digital ob ject to be determined. Second is an open, generic, shared, principled data model for documentation of processes, which enables its collation so that it provides high-quality evidence that a system’s processes occurred. Once documentation has been created, it is recorded into specialised repositories called provenance stores using a formally specified protocol, which ensures documentation has high- quality characteristics. Furthermore, patterns and techniques are given to permit the distributed deployment of provenance stores. The protocol and patterns are the third contribution. The fourth contribution is a characterisation of the use of documentation of process to answer questions related to the provenance of digital ob jects and the impact recording has on application performance. Specifically, in the context of a bioinformatics case study, it is shown that six different provenance use cases are answered given an overhead of 13% on experiment run- time. Beyond the case study, the solution has been applied to other applications including fault tolerance in service-oriented systems, aerospace engineering, and organ transplant management.
author Groth, Paul
author_facet Groth, Paul
author_sort Groth, Paul
title The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
title_short The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
title_full The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
title_fullStr The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
title_full_unstemmed The origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
title_sort origin of data : enabling the determination of provenance in multi-institutional scientific systems through the documentation of processes
publisher University of Southampton
publishDate 2007
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.561452
work_keys_str_mv AT grothpaul theoriginofdataenablingthedeterminationofprovenanceinmultiinstitutionalscientificsystemsthroughthedocumentationofprocesses
AT grothpaul originofdataenablingthedeterminationofprovenanceinmultiinstitutionalscientificsystemsthroughthedocumentationofprocesses
_version_ 1718729559603412992