An Asymptotic Ensemble Learning Framework for Big Data Analysis

In order to enable big data analysis when data volume goes beyond the available computing resources, we propose a new method for big data analysis. This method uses only a few random sample data blocks of a big data set to obtain approximate results for the entire data set. The random sample partiti...

Full description

Bibliographic Details
Main Authors:	Salman Salloum, Joshua Zhexue Huang, Yulin He, Xiaojun Chen
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Big data analysis cluster computing random sample partition block-level sampling distributed and parallel computing approximate computing
Online Access:	https://ieeexplore.ieee.org/document/8586790/

id	doaj-00afc742ec3445af88051cfff772ccd4
record_format	Article
spelling	doaj-00afc742ec3445af88051cfff772ccd42021-03-29T22:09:30ZengIEEEIEEE Access2169-35362019-01-0173675369310.1109/ACCESS.2018.28893558586790An Asymptotic Ensemble Learning Framework for Big Data AnalysisSalman Salloum0https://orcid.org/0000-0002-6750-003XJoshua Zhexue Huang1Yulin He2Xiaojun Chen3National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, ChinaNational Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, ChinaNational Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, ChinaNational Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, ChinaIn order to enable big data analysis when data volume goes beyond the available computing resources, we propose a new method for big data analysis. This method uses only a few random sample data blocks of a big data set to obtain approximate results for the entire data set. The random sample partition (RSP) distributed data model is used to represent a big data set as a set of non-overlapping random sample data blocks. Each block is saved as an RSP data block file that can be used directly to estimate the statistical properties of the entire data set. A subset of RSP data blocks is randomly selected and analyzed with existing sequential algorithms in parallel. Then, the results from these blocks are combined to obtain ensemble estimates and models which can be improved gradually by appending new results from the newly analyzed RSP data blocks. To this end, we propose a distributed data-parallel framework (Alpha framework) and develop a prototype of this framework using Microsoft R Server packages and Hadoop distributed file system. The experimental results of three real data sets show that a subset of RSP data blocks of a data set is sufficient to obtain estimates and models which are equivalent to those computed from the entire data set.https://ieeexplore.ieee.org/document/8586790/Big data analysiscluster computingrandom sample partitionblock-level samplingdistributed and parallel computingapproximate computing
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Salman Salloum Joshua Zhexue Huang Yulin He Xiaojun Chen
spellingShingle	Salman Salloum Joshua Zhexue Huang Yulin He Xiaojun Chen An Asymptotic Ensemble Learning Framework for Big Data Analysis IEEE Access Big data analysis cluster computing random sample partition block-level sampling distributed and parallel computing approximate computing
author_facet	Salman Salloum Joshua Zhexue Huang Yulin He Xiaojun Chen
author_sort	Salman Salloum
title	An Asymptotic Ensemble Learning Framework for Big Data Analysis
title_short	An Asymptotic Ensemble Learning Framework for Big Data Analysis
title_full	An Asymptotic Ensemble Learning Framework for Big Data Analysis
title_fullStr	An Asymptotic Ensemble Learning Framework for Big Data Analysis
title_full_unstemmed	An Asymptotic Ensemble Learning Framework for Big Data Analysis
title_sort	asymptotic ensemble learning framework for big data analysis
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2019-01-01
description	In order to enable big data analysis when data volume goes beyond the available computing resources, we propose a new method for big data analysis. This method uses only a few random sample data blocks of a big data set to obtain approximate results for the entire data set. The random sample partition (RSP) distributed data model is used to represent a big data set as a set of non-overlapping random sample data blocks. Each block is saved as an RSP data block file that can be used directly to estimate the statistical properties of the entire data set. A subset of RSP data blocks is randomly selected and analyzed with existing sequential algorithms in parallel. Then, the results from these blocks are combined to obtain ensemble estimates and models which can be improved gradually by appending new results from the newly analyzed RSP data blocks. To this end, we propose a distributed data-parallel framework (Alpha framework) and develop a prototype of this framework using Microsoft R Server packages and Hadoop distributed file system. The experimental results of three real data sets show that a subset of RSP data blocks of a data set is sufficient to obtain estimates and models which are equivalent to those computed from the entire data set.
topic	Big data analysis cluster computing random sample partition block-level sampling distributed and parallel computing approximate computing
url	https://ieeexplore.ieee.org/document/8586790/
work_keys_str_mv	AT salmansalloum anasymptoticensemblelearningframeworkforbigdataanalysis AT joshuazhexuehuang anasymptoticensemblelearningframeworkforbigdataanalysis AT yulinhe anasymptoticensemblelearningframeworkforbigdataanalysis AT xiaojunchen anasymptoticensemblelearningframeworkforbigdataanalysis AT salmansalloum asymptoticensemblelearningframeworkforbigdataanalysis AT joshuazhexuehuang asymptoticensemblelearningframeworkforbigdataanalysis AT yulinhe asymptoticensemblelearningframeworkforbigdataanalysis AT xiaojunchen asymptoticensemblelearningframeworkforbigdataanalysis
_version_	1724192023551934464

An Asymptotic Ensemble Learning Framework for Big Data Analysis

Similar Items