Building High Performance Data Analytics Systems based on Scale-out Models
Main Author: | |
---|---|
Language: | English |
Published: |
The Ohio State University / OhioLINK
2015
|
Subjects: | |
Online Access: | http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 |
id |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu1427553721 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-OhioLink-oai-etd.ohiolink.edu-osu14275537212021-08-03T06:29:33Z Building High Performance Data Analytics Systems based on Scale-out Models Huai, Yin Computer Science Big Data Systems Table Placement Query Optimization Out-of-band Communications To respond to the data explosion, new system infrastructures have been built based on scale-out models for the purposes of high data availability and reliable large-scale computations.With an increasing amount of adoptions of data analytics systems, users continuously demand high throughput and high performance on various applications.In this dissertation, we identify three critical issues to achieve high throughput and high performance for data analytics, whichare efficient table placement methods (i.e. the method to place structured data),generating high quality distributed query plans without unnecessary data movements,and effective support of out-of-band communications.To address these three issues, we have conducted a comprehensive study on design choices of different table placement methods, designed and implemented two optimizations to remove unnecessary data movements indistributed query plans, and introduced a system facility called {\it SideWalk} to facilitate the implementationof out-of-band communications.In our first work of table placement methods, we comprehensively studied existing table placement methods and generalized the basic structure of table placement methods. Based on the basic structure, we conducted acomprehensive evaluation of different design choices of table placement methods on I/O performance.Based on our evaluation and analysis, we provided a set of guidelines for users and developers to tune their implementations of table placement method.In our second work, we focused on building our optimizations based on Apache Hive, a widely usedopen source data warehousing system in the Hadoop ecosystem. We analyze operatorsthat may require data movements in the context of the entire query plan. Our optimization methods removeunnecessary data movements from the distributedquery plans. Our evaluation shows that these optimization methods can significantly reduce the query execution time.In our third work, we designed and implemented SideWalk, a system facility to implement out-of-band communications.We designed the APIs of SideWalk based on our abstraction of out-of-band communications.With SideWalk, users can implement out-of-band communications in various applications instead of usingad-hoc approaches. Through our evaluation, we show that SideWalk can effectively support out-of-band communications,which will be used in implementing advanced data processing flows,and users can conduct out-of-band communications in a reusable way. Without SideWalk,users commonly need to build out-of-band communications in an ad hoc way, which is hard to reuse and limit theprogramming productivity.The proposed studies in this dissertation has been comprehensively testedand evaluated to show their effectiveness. The guidelines on table placement methods in our table placement method study has been verified by newly implemented and widely used file formats, Optimized Record Columnar File (ORCFile) and Parquet. Optimization methods in our query planner work have been adopted by Apache Hive, which is a widely used data warehousing system in the Hadoop ecosystem and is shipped with all of major Hadoop vendors. 2015-05-21 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws. |
collection |
NDLTD |
language |
English |
sources |
NDLTD |
topic |
Computer Science Big Data Systems Table Placement Query Optimization Out-of-band Communications |
spellingShingle |
Computer Science Big Data Systems Table Placement Query Optimization Out-of-band Communications Huai, Yin Building High Performance Data Analytics Systems based on Scale-out Models |
author |
Huai, Yin |
author_facet |
Huai, Yin |
author_sort |
Huai, Yin |
title |
Building High Performance Data Analytics Systems based on Scale-out Models |
title_short |
Building High Performance Data Analytics Systems based on Scale-out Models |
title_full |
Building High Performance Data Analytics Systems based on Scale-out Models |
title_fullStr |
Building High Performance Data Analytics Systems based on Scale-out Models |
title_full_unstemmed |
Building High Performance Data Analytics Systems based on Scale-out Models |
title_sort |
building high performance data analytics systems based on scale-out models |
publisher |
The Ohio State University / OhioLINK |
publishDate |
2015 |
url |
http://rave.ohiolink.edu/etdc/view?acc_num=osu1427553721 |
work_keys_str_mv |
AT huaiyin buildinghighperformancedataanalyticssystemsbasedonscaleoutmodels |
_version_ |
1719437730116534272 |