Distributed computing system and big data real-time processing structure—based on YARN, Storm and Spark

碩士 === 國立政治大學 === 資訊管理研究所 === 103 === With the coming of the era of big data, the immediacy and the amount of data computation are facing with many challenges. For example, for Futures market forecasting, we need to accurately forecast the market state with the model built from large data (hundreds...

Full description

Bibliographic Details
Main Authors: Tseng, Po Wei, 曾柏崴
Other Authors: Liou, Wen Ching
Format: Others
Language:en_US
Online Access:http://ndltd.ncl.edu.tw/handle/5q9482
Description
Summary:碩士 === 國立政治大學 === 資訊管理研究所 === 103 === With the coming of the era of big data, the immediacy and the amount of data computation are facing with many challenges. For example, for Futures market forecasting, we need to accurately forecast the market state with the model built from large data (hundreds of GB to tens of TB) within tens of milliseconds. In this research, we will introduce a real-time big data computing architecture to resolve requests of high speed processing, the immense volume of data and the request of large data processing. In the meantime, several algorithms, such as SVM (Support Vector Machine, SVM) and LR (Logistic Regression, LR), are implemented as a subproject under the parallel distributed computing system. This architecture involves three main cloud computing techniques: 1. Use Apache YARN as a system of integrated resource management in order to apply cluster resources more efficiently. 2. To satisfy the requests of high speed processing, we apply Apache Storm in order to process large real-time data stream and compute thousands of numerical value within tens of milliseconds for following model building. 3. With Apache Spark, we establish a distributed computing architecture for model building. By using Spark RDD (Resilient Distributed Datasets, RDD), this architecture can shorten the execution time to within hundreds of milliseconds for SVM and LR model building. To resolve the requirements of the distributed system, we design an n-tier distributed architecture to integrate the foregoing several techniques. In this architecture, we use the Apache Kafka as the messaging middleware to support asynchronous message-based communication.