Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest

abstract: Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have...

Full description

Bibliographic Details
Other Authors: Guan, Xin (Author)
Format: Doctoral Thesis
Language:English
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/2286/R.I.45017
id ndltd-asu.edu-item-45017
record_format oai_dc
spelling ndltd-asu.edu-item-450172018-06-22T03:08:40Z Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest abstract: Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval. Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships. Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets. Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets. Dissertation/Thesis Guan, Xin (Author) Liu, Li (Advisor) Runger, George (Advisor) Dinu, Valentin (Committee member) Arizona State University (Publisher) Biostatistics feature selection prediction interval predictive modeling random forest eng 119 pages Doctoral Dissertation Biomedical Informatics 2017 Doctoral Dissertation http://hdl.handle.net/2286/R.I.45017 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2017
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic Biostatistics
feature selection
prediction interval
predictive modeling
random forest
spellingShingle Biostatistics
feature selection
prediction interval
predictive modeling
random forest
Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
description abstract: Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval. Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships. Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets. Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets. === Dissertation/Thesis === Doctoral Dissertation Biomedical Informatics 2017
author2 Guan, Xin (Author)
author_facet Guan, Xin (Author)
title Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
title_short Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
title_full Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
title_fullStr Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
title_full_unstemmed Novel Methods of Biomarker Discovery and Predictive Modeling using Random Forest
title_sort novel methods of biomarker discovery and predictive modeling using random forest
publishDate 2017
url http://hdl.handle.net/2286/R.I.45017
_version_ 1718701534926077952