FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2016
|
Online Access: | http://ndltd.ncl.edu.tw/handle/08549002270185219007 |
id |
ndltd-TW-104NCKU5652012 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-104NCKU56520122017-10-15T04:37:06Z http://ndltd.ncl.edu.tw/handle/08549002270185219007 FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark 可執行於多個 Spark 上之彈性分散式資料集 Tzu-LiTai 戴資力 碩士 國立成功大學 電腦與通信工程研究所 104 Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research. Ce-Kuen Shieh Jyh-Biau Chang 謝錫堃 張志標 2016 學位論文 ; thesis 37 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research.
|
author2 |
Ce-Kuen Shieh |
author_facet |
Ce-Kuen Shieh Tzu-LiTai 戴資力 |
author |
Tzu-LiTai 戴資力 |
spellingShingle |
Tzu-LiTai 戴資力 FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
author_sort |
Tzu-LiTai |
title |
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
title_short |
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
title_full |
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
title_fullStr |
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
title_full_unstemmed |
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark |
title_sort |
fedrdd: federated resilient distributed datasets for multicluster computing on apache spark |
publishDate |
2016 |
url |
http://ndltd.ncl.edu.tw/handle/08549002270185219007 |
work_keys_str_mv |
AT tzulitai fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark AT dàizīlì fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark AT tzulitai kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí AT dàizīlì kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí |
_version_ |
1718555171378692096 |