Leveraging High Performance Computing for Managing Large and Evolving Data Collections
<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",s...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
University of Edinburgh
2014-10-01
|
Series: | International Journal of Digital Curation |
Online Access: | http://www.ijdc.net/index.php/ijdc/article/view/331 |
id |
doaj-022fe98bd1624d82b47d0f05fe18f588 |
---|---|
record_format |
Article |
spelling |
doaj-022fe98bd1624d82b47d0f05fe18f5882020-11-25T00:13:18ZengUniversity of EdinburghInternational Journal of Digital Curation1746-82562014-10-0192172710.2218/ijdc.v9i2.331290Leveraging High Performance Computing for Managing Large and Evolving Data CollectionsRitu AroraMaria EstevaJessica Trelogan<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p>http://www.ijdc.net/index.php/ijdc/article/view/331 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Ritu Arora Maria Esteva Jessica Trelogan |
spellingShingle |
Ritu Arora Maria Esteva Jessica Trelogan Leveraging High Performance Computing for Managing Large and Evolving Data Collections International Journal of Digital Curation |
author_facet |
Ritu Arora Maria Esteva Jessica Trelogan |
author_sort |
Ritu Arora |
title |
Leveraging High Performance Computing for Managing Large and Evolving Data Collections |
title_short |
Leveraging High Performance Computing for Managing Large and Evolving Data Collections |
title_full |
Leveraging High Performance Computing for Managing Large and Evolving Data Collections |
title_fullStr |
Leveraging High Performance Computing for Managing Large and Evolving Data Collections |
title_full_unstemmed |
Leveraging High Performance Computing for Managing Large and Evolving Data Collections |
title_sort |
leveraging high performance computing for managing large and evolving data collections |
publisher |
University of Edinburgh |
series |
International Journal of Digital Curation |
issn |
1746-8256 |
publishDate |
2014-10-01 |
description |
<!-- p.abstract-western { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Times New Roman",serif; font-size: 10pt; }p.abstract-cjk { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "DejaVu Sans","Arial",sans-serif; font-size: 10pt; }p.abstract-ctl { margin-left: 1.27cm; margin-right: 1.27cm; margin-bottom: 0.18cm; font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }p { text-indent: 0.64cm; margin-bottom: 0cm; direction: ltr; color: rgb(0, 0, 0); widows: 2; orphans: 2; }p.western { font-family: "Times New Roman",serif; font-size: 12pt; }p.cjk { font-family: "DejaVu Sans","Arial",sans-serif; font-size: 12pt; }p.ctl { font-family: "Lohit Hindi","Times New Roman"; font-size: 12pt; }a.cjk:visited { }a:link { color: rgb(0, 107, 107); text-decoration: none; }a.western:link { }a.ctl:link { }a.sdfootnotesym-western { font-size: 7pt; }a.sdfootnotesym-cjk { font-size: 7pt; } --> <p class="abstract-western">The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.</p> |
url |
http://www.ijdc.net/index.php/ijdc/article/view/331 |
work_keys_str_mv |
AT rituarora leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections AT mariaesteva leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections AT jessicatrelogan leveraginghighperformancecomputingformanaginglargeandevolvingdatacollections |
_version_ |
1725395146351050752 |