On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications

As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms fo...

Full description

Bibliographic Details
Main Author: Jiang, Lei
Other Authors: Tom, Michael
Format: Others
Language:en
Published: LSU 2013
Subjects:
Online Access:http://etd.lsu.edu/docs/available/etd-04192013-130821/
id ndltd-LSU-oai-etd.lsu.edu-etd-04192013-130821
record_format oai_dc
spelling ndltd-LSU-oai-etd.lsu.edu-etd-04192013-1308212013-05-02T15:25:35Z On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications Jiang, Lei Computer Science As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources.<br><br> In this work we propose a cross-platform programming paradigm, called "on-the-fly data tracing", to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a "big-data approximation" computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. <br><br> The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements. Tom, Michael Chen, Qin J. Zhang, Jian Allen, Gabrielle LSU 2013-05-01 text application/pdf http://etd.lsu.edu/docs/available/etd-04192013-130821/ http://etd.lsu.edu/docs/available/etd-04192013-130821/ en unrestricted I hereby certify that, if appropriate, I have obtained and attached herein a written permission statement from the owner(s) of each third party copyrighted matter to be included in my thesis, dissertation, or project report, allowing distribution as specified below. I certify that the version I submitted is the same as that approved by my advisory committee. I hereby grant to LSU or its agents the non-exclusive license to archive and make accessible, under the conditions specified below and in appropriate University policies, my thesis, dissertation, or project report in whole or in part in all forms of media, now or hereafter known. I retain all other ownership rights to the copyright of the thesis, dissertation or project report. I also retain the right to use in future works (such as articles or books) all or part of this thesis, dissertation, or project report.
collection NDLTD
language en
format Others
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Jiang, Lei
On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
description As data-centric computing becomes the trend in science and engineering, more and more hardware systems, as well as middleware frameworks, are emerging to handle the intensive computations associated with big data. At the programming level, it is crucial to have corresponding programming paradigms for dealing with big data. Although MapReduce is now a known programming model for data-centric computing where parallelization is completely replaced by partitioning the computing task through data, not all programs particularly those using statistical computing and data mining algorithms with interdependence can be re-factorized in such a fashion. On the other hand, many traditional automatic parallelization methods put an emphasis on formalism and may not achieve optimal performance with the given limited computing resources.<br><br> In this work we propose a cross-platform programming paradigm, called "on-the-fly data tracing", to provide source-to-source transformation where the same framework also provides the functionality of workflow optimization on larger applications. Using a "big-data approximation" computations related to large-scale data input are identified in the code and workflow and a simplified core dependence graph is built based on the computational load taking in to account big data. The code can then be partitioned into sections for efficient parallelization; and at the workflow level, optimization can be performed by adjusting the scheduling for big-data considerations, including the I/O performance of the machine. Regarding each unit in both source code and workflow as a model, this framework enables model-based parallel programming that matches the available computing resources. <br><br> The techniques used in model-based parallel programming as well as the design of the software framework for both parallelization and workflow optimization as well as its implementations with multiple programming languages are presented in the dissertation. Then, the following experiments are performed to validate the framework: i) the benchmarking of parallelization speed-up using typical examples in data analysis and machine learning (e.g. naive Bayes, k-means) and ii) three real-world applications in data-centric computing with the framework are also described to illustrate the efficiency: pattern detection from hurricane and storm surge simulations, road traffic flow prediction and text mining from social media data. In the applications, it illustrates how to build scalable workflows with the framework along with performance enhancements.
author2 Tom, Michael
author_facet Tom, Michael
Jiang, Lei
author Jiang, Lei
author_sort Jiang, Lei
title On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
title_short On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
title_full On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
title_fullStr On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
title_full_unstemmed On-the-Fly Tracing for Data-Centric Computing: Parallelization, Workflow and Applications
title_sort on-the-fly tracing for data-centric computing: parallelization, workflow and applications
publisher LSU
publishDate 2013
url http://etd.lsu.edu/docs/available/etd-04192013-130821/
work_keys_str_mv AT jianglei ontheflytracingfordatacentriccomputingparallelizationworkflowandapplications
_version_ 1716585184420167680