FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark

碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient...

Full description

Bibliographic Details
Main Authors:	Tzu-LiTai, 戴資力
Other Authors:	Ce-Kuen Shieh
Format:	Others
Language:	en_US
Published:	2016
Online Access:	http://ndltd.ncl.edu.tw/handle/08549002270185219007

id	ndltd-TW-104NCKU5652012
record_format	oai_dc
spelling	ndltd-TW-104NCKU56520122017-10-15T04:37:06Z http://ndltd.ncl.edu.tw/handle/08549002270185219007 FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark 可執行於多個 Spark 上之彈性分散式資料集 Tzu-LiTai 戴資力碩士國立成功大學電腦與通信工程研究所 104 Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research. Ce-Kuen Shieh Jyh-Biau Chang 謝錫堃張志標 2016 學位論文 ; thesis 37 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research.
author2	Ce-Kuen Shieh
author_facet	Ce-Kuen Shieh Tzu-LiTai 戴資力
author	Tzu-LiTai 戴資力
spellingShingle	Tzu-LiTai 戴資力 FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
author_sort	Tzu-LiTai
title	FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_short	FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_full	FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_fullStr	FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_full_unstemmed	FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_sort	fedrdd: federated resilient distributed datasets for multicluster computing on apache spark
publishDate	2016
url	http://ndltd.ncl.edu.tw/handle/08549002270185219007
work_keys_str_mv	AT tzulitai fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark AT dàizīlì fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark AT tzulitai kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí AT dàizīlì kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí
_version_	1718555171378692096

FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark

Similar Items