Leveraging High Performance Computing for Managing Large and Evolving Data Collections

<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",s...

Full description

Bibliographic Details
Main Authors:	Ritu Arora, Maria Esteva, Jessica Trelogan
Format:	Article
Language:	English
Published:	University of Edinburgh 2014-10-01
Series:	International Journal of Digital Curation
Online Access:	http://www.ijdc.net/index.php/ijdc/article/view/331

id	doaj-022fe98bd1624d82b47d0f05fe18f588
record_format	Article
spelling	doaj-022fe98bd1624d82b47d0f05fe18f5882020-11-25T00:13:18ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562014-10-0192172710.2218/ijdc.v9i2.331290Leveraging High Performance Computing for Managing Large and Evolving Data CollectionsRitu AroraMaria EstevaJessica Trelogan<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p>http://www.ijdc.net/index.php/ijdc/article/view/331
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Ritu Arora Maria Esteva Jessica Trelogan
spellingShingle	Ritu Arora Maria Esteva Jessica Trelogan Leveraging High Performance Computing for Managing Large and Evolving Data Collections International Journal of Digital Curation
author_facet	Ritu Arora Maria Esteva Jessica Trelogan
author_sort	Ritu Arora
title	Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_short	Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_full	Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_fullStr	Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_full_unstemmed	Leveraging High Performance Computing for Managing Large and Evolving Data Collections
title_sort	leveraging high performance computing for managing large and evolving data collections
publisher	University of Edinburgh
series	International Journal of Digital Curation
issn	1746-8256
publishDate	2014-10-01
description	<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p>
url	http://www.ijdc.net/index.php/ijdc/article/view/331
work_keys_str_mv	AT rituarora leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections AT mariaesteva leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections AT jessicatrelogan leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections
_version_	1725395146351050752

Leveraging High Performance Computing for Managing Large and Evolving Data Collections

Similar Items