Big Data Analytics Platform Establishment: Efficiency Analysis and Spam Email Filtering

碩士 === 國立中山大學 === 電機工程學系研究所 === 104 === The era of big data has come. Except for the profit, big data brings more challenges for data analysis. Thus, the Hadoop platform is proposed to analyze big data. Hadoop platform uses MapReduce and HDFS for efficient big data analytics and storage with several...

Full description

Bibliographic Details
Main Authors: Hsiang-Erh Lai, 賴向二
Other Authors: Chia-Hung Yeh
Format: Others
Language:en_US
Published: 2015
Online Access:http://ndltd.ncl.edu.tw/handle/8zzjgm
Description
Summary:碩士 === 國立中山大學 === 電機工程學系研究所 === 104 === The era of big data has come. Except for the profit, big data brings more challenges for data analysis. Thus, the Hadoop platform is proposed to analyze big data. Hadoop platform uses MapReduce and HDFS for efficient big data analytics and storage with several commodity computers. In this thesis, we implemented the Hadoop big data platform, and delivered several efficiency analysis. In addition, a spam filtering system is also implemented on Hadoop platform. The spam filtering system is comprised of term frequency-inverse document frequency (TF-IDF) to extract the keyword features from emails and naive Bayes classifier to classifying email as spam or non-spam. In the experiments, we compared with SpamAssassin, which is robust and widely used in Linux. As experimental results show, we can detect most spam that also detected by Spamassassin, and nearly incorrectly classifies non-spam emails as spam. Most importantly, the computing performance is 10 times faster of SpamAssassin.