Leveraging High Performance Computing for Managing Large and Evolving Data Collections

<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",s...

Full description

Bibliographic Details
Main Authors: Ritu Arora, Maria Esteva, Jessica Trelogan
Format: Article
Language:English
Published: University of Edinburgh 2014-10-01
Series:International Journal of Digital Curation
Online Access:http://www.ijdc.net/index.php/ijdc/article/view/331
id doaj-022fe98bd1624d82b47d0f05fe18f588
record_format Article
spelling doaj-022fe98bd1624d82b47d0f05fe18f5882020-11-25T00:13:18ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562014-10-0192172710.2218/ijdc.v9i2.331290Leveraging High Performance Computing for Managing Large and Evolving Data CollectionsRitu AroraMaria EstevaJessica Trelogan<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p>http://www.ijdc.net/index.php/ijdc/article/view/331
collection DOAJ
language English
format Article
sources DOAJ
author Ritu Arora
Maria Esteva
Jessica Trelogan
spellingShingle Ritu Arora
Maria Esteva
Jessica Trelogan
Leveraging High Performance Computing for Managing Large and Evolving Data Collections
International Journal of Digital Curation
author_facet Ritu Arora
Maria Esteva
Jessica Trelogan
author_sort Ritu Arora
title Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_short Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_full Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_fullStr Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_full_unstemmed Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_sort leveraging high performance computing for managing large and evolving data collections
publisher University of Edinburgh
series International Journal of Digital Curation
issn 1746-8256
publishDate 2014-10-01
description <!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p>
url http://www.ijdc.net/index.php/ijdc/article/view/331
work_keys_str_mv AT rituarora leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections
AT mariaesteva leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections
AT jessicatrelogan leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections
_version_ 1725395146351050752