Parallel Bayesian Additive Regression Trees, using Apache Spark

New methods have been developed to find patterns and trends in order to gainknowledge from large datasets in various disciplines, such as bioinformatics, consumer behavior in advertising and weather forecasting.The goal of many of these new methods is to construct prediction models from the data. Li...

Full description

Bibliographic Details
Main Author: Geirsson, Sigurdur
Format: Others
Language:English
Published: Uppsala universitet, Institutionen för informationsteknologi 2017
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-322247
Description
Summary:New methods have been developed to find patterns and trends in order to gainknowledge from large datasets in various disciplines, such as bioinformatics, consumer behavior in advertising and weather forecasting.The goal of many of these new methods is to construct prediction models from the data. Linear regression, which is widely used for analyzing data, is very powerful ford etecting simple patterns, but higher complexity requires a more sophisticated solution. Regression trees split up the problem into numerous parts but they do not generalizewell as they tend to have high variance. Ensemble methods, a collection of regressiontrees, solves that problem by spreading the model over numerous trees. Ensemble methods such as Random Forest, Gradient Boosted Trees and Bayesian Additive Regression Trees, all have different ways to constructing prediction modelfrom data. Using these models for large datasets are computationally demanding.The aim of this work is to explore a parallel implementation of Bayesian Additive Regression Trees (BART) using Apache Spark framework. Spark is ideal in this case asit is great for iterative and data intensive jobs.We show that our parallel implementation is about 35 times faster for a dataset of pig's genomes. Most of the speed improvement is due to serial code modification that minimizes scanning of the data.The gain from parallelization is a speedup of 2.2x, gained by using four cores on aquad core system. Measurements on a computer clusters consisting of four computers resulted in a maximum speedup of 2.1x for eight cores.We should emphasize that these gains are heavily dependent on size of datasets.