Building High Performance Data Analytics Systems based on Scale-out Models

Bibliographic Details
Main Author: Huai, Yin
Language:English
Published: The Ohio State University / OhioLINK 2015
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1427553721
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu14275537212021-08-03T06:29:33Z Building High Performance Data Analytics Systems based on Scale-out Models Huai, Yin Computer Science Big Data Systems Table Placement Query Optimization Out-of-band Communications To respond to the data explosion, new system infrastructures have been built based on scale-out models for the purposes of high data availability and reliable large-scale computations.With an increasing amount of adoptions of data analytics systems, users continuously demand high throughput and high performance on various applications.In this dissertation, we identify three critical issues to achieve high throughput and high performance for data analytics, whichare efficient table placement methods (i.e. the method to place structured data),generating high quality distributed query plans without unnecessary data movements,and effective support of out-of-band communications.To address these three issues, we have conducted a comprehensive study on design choices of different table placement methods, designed and implemented two optimizations to remove unnecessary data movements indistributed query plans, and introduced a system facility called {\it SideWalk} to facilitate the implementationof out-of-band communications.In our first work of table placement methods, we comprehensively studied existing table placement methods and generalized the basic structure of table placement methods. Based on the basic structure, we conducted acomprehensive evaluation of different design choices of table placement methods on I/O performance.Based on our evaluation and analysis, we provided a set of guidelines for users and developers to tune their implementations of table placement method.In our second work, we focused on building our optimizations based on Apache Hive, a widely usedopen source data warehousing system in the Hadoop ecosystem. We analyze operatorsthat may require data movements in the context of the entire query plan. Our optimization methods removeunnecessary data movements from the distributedquery plans. Our evaluation shows that these optimization methods can significantly reduce the query execution time.In our third work, we designed and implemented SideWalk, a system facility to implement out-of-band communications.We designed the APIs of SideWalk based on our abstraction of out-of-band communications.With SideWalk, users can implement out-of-band communications in various applications instead of usingad-hoc approaches. Through our evaluation, we show that SideWalk can effectively support out-of-band communications,which will be used in implementing advanced data processing flows,and users can conduct out-of-band communications in a reusable way. Without SideWalk,users commonly need to build out-of-band communications in an ad hoc way, which is hard to reuse and limit theprogramming productivity.The proposed studies in this dissertation has been comprehensively testedand evaluated to show their effectiveness. The guidelines on table placement methods in our table placement method study has been verified by newly implemented and widely used file formats, Optimized Record Columnar File (ORCFile) and Parquet. Optimization methods in our query planner work have been adopted by Apache Hive, which is a widely used data warehousing system in the Hadoop ecosystem and is shipped with all of major Hadoop vendors. 2015-05-21 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Computer Science
Big Data
Systems
Table Placement
Query Optimization
Out-of-band Communications
spellingShingle Computer Science
Big Data
Systems
Table Placement
Query Optimization
Out-of-band Communications
Huai, Yin
Building High Performance Data Analytics Systems based on Scale-out Models
author Huai, Yin
author_facet Huai, Yin
author_sort Huai, Yin
title Building High Performance Data Analytics Systems based on Scale-out Models
title_short Building High Performance Data Analytics Systems based on Scale-out Models
title_full Building High Performance Data Analytics Systems based on Scale-out Models
title_fullStr Building High Performance Data Analytics Systems based on Scale-out Models
title_full_unstemmed Building High Performance Data Analytics Systems based on Scale-out Models
title_sort building high performance data analytics systems based on scale-out models
publisher The Ohio State University / OhioLINK
publishDate 2015
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721
work_keys_str_mv AT huaiyin buildinghighperformancedataanalyticssystemsbasedonscaleoutmodels
_version_ 1719437730116534272