An empirical study of practical, theoretical and online variants of random forests

Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with differe...

Full description

Bibliographic Details
Main Author: Matheson, David
Language:English
Published: University of British Columbia 2014
Online Access:http://hdl.handle.net/2429/46586
id ndltd-UBC-oai-circle.library.ubc.ca-2429-46586
record_format oai_dc
spelling ndltd-UBC-oai-circle.library.ubc.ca-2429-465862018-01-05T17:27:22Z An empirical study of practical, theoretical and online variants of random forests Matheson, David Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data. Science, Faculty of Computer Science, Department of Graduate 2014-04-25T15:59:06Z 2014-04-25T15:59:06Z 2014 2014-09 Text Thesis/Dissertation http://hdl.handle.net/2429/46586 eng Attribution-NoDerivs 2.5 Canada http://creativecommons.org/licenses/by-nd/2.5/ca/ University of British Columbia
collection NDLTD
language English
sources NDLTD
description Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data. === Science, Faculty of === Computer Science, Department of === Graduate
author Matheson, David
spellingShingle Matheson, David
An empirical study of practical, theoretical and online variants of random forests
author_facet Matheson, David
author_sort Matheson, David
title An empirical study of practical, theoretical and online variants of random forests
title_short An empirical study of practical, theoretical and online variants of random forests
title_full An empirical study of practical, theoretical and online variants of random forests
title_fullStr An empirical study of practical, theoretical and online variants of random forests
title_full_unstemmed An empirical study of practical, theoretical and online variants of random forests
title_sort empirical study of practical, theoretical and online variants of random forests
publisher University of British Columbia
publishDate 2014
url http://hdl.handle.net/2429/46586
work_keys_str_mv AT mathesondavid anempiricalstudyofpracticaltheoreticalandonlinevariantsofrandomforests
AT mathesondavid empiricalstudyofpracticaltheoreticalandonlinevariantsofrandomforests
_version_ 1718584259216670720