FedRDD: Federated Resilient Distributed Datasets for Multicluster Computing on Apache Spark

碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient...

Full description

Bibliographic Details
Main Authors: Tzu-LiTai, 戴資力
Other Authors: Ce-Kuen Shieh
Format: Others
Language:en_US
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/08549002270185219007
Description
Summary:碩士 === 國立成功大學 === 電腦與通信工程研究所 === 104 === Apache Spark, an in-memory cluster system, has largely grown in popularity and industrial acceptance due to its capabilities to perform iterative computations with surpassing efficiency compared to MapReduce. Spark's success mainly lies in its Resilient Distributed Dataset (RDD) abstraction, which allows cluster application designers to flexibly express multi-level, complex workflows with RDD transformations. However, like MapReduce, Spark and the RDD abstraction was originally designed for cluster computing on a single set of tightly-coupled nodes. This initial design fails to meet an emerging genre of applications where input data are generated, stored and maintained at multiple independent datacenters. Multicluster systems had been designed to efficiently process such federated datasets. While this area of research has thoroughly explored MapReduce-based solutions, RDD-based approaches remains a new and unexplored topic. We propose Federated RDD (FedRDD), an extension to the RDD abstraction for multicluster computing. We focus on two aims: 1) user transparency and 2) optimized data aggregation. To meet these challenges, we have designed FedRDD with a master-slave hierarchy in which multiple slave RDDs are maintained. Following this design, a FedRDD global transformation can be executed as regional transformations on slave RDDs while still being exposed to the user as the original interface. Moreover, a transformation sequence refactoring formulation is proposed based on this design to minimize inter-cluster data transmission for the purpose of aggregation optimization. As a first effort to explore RDD-based multicluster system solutions, we believe our research provides valuable insight for further research.