FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark

碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient...

Full description

Bibliographic Details
Main Authors: Tzu-LiTai, 戴資力
Other Authors: Ce-Kuen Shieh
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/08549002270185219007
id ndltd-TW-104NCKU5652012
record_format oai_dc
spelling ndltd-TW-104NCKU56520122017-10-15T04:37:06Z http://ndltd.ncl.edu.tw/handle/08549002270185219007 FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark 可執行於多個 Spark 上之彈性分散式資料集 Tzu-LiTai 戴資力 碩士 國立成功大學 電腦與通信工程研究所 104 Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research. Ce-Kuen Shieh Jyh-Biau Chang 謝錫堃 張志標 2016 學位論文 ; thesis 37 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research.
author2 Ce-Kuen Shieh
author_facet Ce-Kuen Shieh
Tzu-LiTai
戴資力
author Tzu-LiTai
戴資力
spellingShingle Tzu-LiTai
戴資力
FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
author_sort Tzu-LiTai
title FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_short FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_full FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_fullStr FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_full_unstemmed FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark
title_sort fedrdd: federated resilient distributed datasets for multicluster computing on apache spark
publishDate 2016
url http://ndltd.ncl.edu.tw/handle/08549002270185219007
work_keys_str_mv AT tzulitai fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark
AT dàizīlì fedrddfederatedresilientdistributeddatasetsformulticlustercomputingonapachespark
AT tzulitai kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí
AT dàizīlì kězhíxíngyúduōgèsparkshàngzhīdànxìngfēnsànshìzīliàojí
_version_ 1718555171378692096