Optimization Study of Applying Apache Spark on Plain Text Big Data with Association Rules Operations

碩士 === 國立彰化師範大學 === 資訊工程學系 === 107 === Plain texts generated by humans on the Internet is increasing. The ISPs also use this data to create competitive systems that provide more appropriate services. In various of big data computing frameworks, it is quite common to use Apache Spark to process plain...

Full description

Bibliographic Details
Main Author: 熊原朗
Other Authors: 賴聯福
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/6b74rz
Description
Summary:碩士 === 國立彰化師範大學 === 資訊工程學系 === 107 === Plain texts generated by humans on the Internet is increasing. The ISPs also use this data to create competitive systems that provide more appropriate services. In various of big data computing frameworks, it is quite common to use Apache Spark to process plain text data and use collaborative filtering to build recommendation systems. However, when using Spark for data processing, it may encounter that developers implement different APIs for specific text operations, which have a considerable impact on the performance and efficiency. Moreover, many of researchers and medium-sized enterprises run small-scale clusters, and most of the research on Spark parameter adjustment is in large-scale clusters. For small-scale clusters, there will be different interactions between parameters and node performance. This paper provides a performance optimization study for small-scale cluster deployment in the context of the application of Spark to process plain text big data association rules operations. Through different APIs and different operating parameters, to meet the lack of computational power of small-scale clusters to achieve the highest efficiency in a limited environment. Using the improved implementation of this paper, the maximum speed can be increased by 3.44 times, and the operation can be completed when the output data size exceeds 3 times of the available memory of a single node. After simulating the small-scale cluster load, it is found that using Kryo serialization, recommended parallelism, and giving Spark its own allocation of core resources instead of manual allocation, the highest computing performance can be obtained.