An empirical study of practical, theoretical and online variants of random forests

Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with differe...

Full description

Bibliographic Details
Main Author:	Matheson, David
Language:	English
Published:	University of British Columbia 2014
Online Access:	http://hdl.handle.net/2429/46586

id	ndltd-UBC-oai-circle.library.ubc.ca-2429-46586
record_format	oai_dc
spelling	ndltd-UBC-oai-circle.library.ubc.ca-2429-465862018-01-05T17:27:22Z An empirical study of practical, theoretical and online variants of random forests Matheson, David Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data. Science, Faculty of Computer Science, Department of Graduate 2014-04-25T15:59:06Z 2014-04-25T15:59:06Z 2014 2014-09 Text Thesis/Dissertation http://hdl.handle.net/2429/46586 eng Attribution-NoDerivs 2.5 Canada http://creativecommons.org/licenses/by-nd/2.5/ca/ University of British Columbia
collection	NDLTD
language	English
sources	NDLTD
description	Random forests are ensembles of randomized decision trees where diversity is created by injecting randomness into the fitting of each tree. The combination of their accuracy and their simplicity has resulted in their adoption in many applications. Different variants have been developed with different goals in mind: improving predictive accuracy, extending the range of application to online and structure domains, and introducing simplifications for theoretical amenability. While there are many subtle differences among the variants, the core difference is the method of selecting candidate split points. In our work, we examine eight different strategies for selecting candidate split points and study their effect on predictive accuracy, individual strength, diversity, computation time and model complexity. We also examine the effect of different parameter settings and several other design choices including bagging, subsampling data points at each node, taking linear combinations of features, splitting data points into structure and estimation streams and using a fixed frontier for online variants. Our empirical study finds several trends, some of which are in contrast to commonly held beliefs, that have value to practitioners and theoreticians. For variants used by practitioners the most important discoveries include: bagging almost never improves predictive accuracy, selecting candidate split points at all midpoints can achieve lower error than selecting them uniformly at random, and subsampling data points at each node decreases training time without affecting predictive accuracy. We also show that the gap between variants with proofs of consistency and those used in practice can be accounted for by the requirement to split data points into structure and estimation streams. Our work with online forests demonstrates the potential improvement that is possible by selecting candidate split points at data points, constraining memory with a fixed frontier and training with multiple passes through the data. === Science, Faculty of === Computer Science, Department of === Graduate
author	Matheson, David
spellingShingle	Matheson, David An empirical study of practical, theoretical and online variants of random forests
author_facet	Matheson, David
author_sort	Matheson, David
title	An empirical study of practical, theoretical and online variants of random forests
title_short	An empirical study of practical, theoretical and online variants of random forests
title_full	An empirical study of practical, theoretical and online variants of random forests
title_fullStr	An empirical study of practical, theoretical and online variants of random forests
title_full_unstemmed	An empirical study of practical, theoretical and online variants of random forests
title_sort	empirical study of practical, theoretical and online variants of random forests
publisher	University of British Columbia
publishDate	2014
url	http://hdl.handle.net/2429/46586
work_keys_str_mv	AT mathesondavid anempiricalstudyofpracticaltheoreticalandonlinevariantsofrandomforests AT mathesondavid empiricalstudyofpracticaltheoreticalandonlinevariantsofrandomforests
_version_	1718584259216670720

An empirical study of practical, theoretical and online variants of random forests

Similar Items