Rare-class classification using ensembles of subsets of variables

An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis...

Full description

Bibliographic Details
Main Author: Tomal, Jabed H.
Language:English
Published: University of British Columbia 2013
Online Access:http://hdl.handle.net/2429/44981
id ndltd-UBC-oai-circle.library.ubc.ca-2429-44981
record_format oai_dc
spelling ndltd-UBC-oai-circle.library.ubc.ca-2429-449812018-01-05T17:26:49Z Rare-class classification using ensembles of subsets of variables Tomal, Jabed H. An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set. Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds. The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets. The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests. Science, Faculty of Statistics, Department of Graduate 2013-08-30T22:45:38Z 2014-02-28T00:00:00Z 2013 2013-11 Text Thesis/Dissertation http://hdl.handle.net/2429/44981 eng Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ University of British Columbia
collection NDLTD
language English
sources NDLTD
description An ensemble of classifiers is proposed for predictive ranking of the observations in a dataset so that the rare class observations are found in the top of the ranked list. Four drug-discovery bioassay datasets, containing a few active and majority inactive chemical compounds, are used in this thesis. The compounds' activity status serves as the response variable while a set of descriptors, describing the structures of chemical compounds, serve as predictors. Five separate descriptor sets are used in each assay. The proposed ensemble aggregates over the descriptor sets by averaging probabilities of activity from random forests applied to the five descriptor sets. The resulting ensemble ensures better predictive ranking than the most accurate random forest applied to a single descriptor set. Motivated from the results of the ensemble of descriptor sets, an algorithm is developed to uncover data-adaptive subsets of variables (we call phalanxes) in a variable rich descriptor set. Capitalizing on the richness of variables, the algorithm looks for the sets of predictors that work well together in a classifier. The data-adaptive phalanxes are so formed that they help each other while forming an ensemble. The phalanxes are aggregated by averaging probabilities of activity from random forests applied to the phalanxes. The ensemble of phalanxes (EPX) outperforms random forests and regularized random forests in terms of predictive ranking. In general, EPX performs very well in a descriptor set with many variables, and in a bioassay containing a few active compounds. The phalanxes are also aggregated within and across the descriptor sets. In all of the four bioassays, the resulting ensemble outperforms the ensemble of descriptor sets, and random forests applied to the pool of the five descriptor sets. The ensemble of phalanxes is also adapted to a logistic regression model and applied to the protein homology dataset downloaded from the KDD Cup 2004 competition. The ensembles are applied to a real test set. The adapted version of the ensemble is found more powerful in terms of predictive ranking and less computationally demanding than the original ensemble of phalanxes with random forests. === Science, Faculty of === Statistics, Department of === Graduate
author Tomal, Jabed H.
spellingShingle Tomal, Jabed H.
Rare-class classification using ensembles of subsets of variables
author_facet Tomal, Jabed H.
author_sort Tomal, Jabed H.
title Rare-class classification using ensembles of subsets of variables
title_short Rare-class classification using ensembles of subsets of variables
title_full Rare-class classification using ensembles of subsets of variables
title_fullStr Rare-class classification using ensembles of subsets of variables
title_full_unstemmed Rare-class classification using ensembles of subsets of variables
title_sort rare-class classification using ensembles of subsets of variables
publisher University of British Columbia
publishDate 2013
url http://hdl.handle.net/2429/44981
work_keys_str_mv AT tomaljabedh rareclassclassificationusingensemblesofsubsetsofvariables
_version_ 1718583965812523008