Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification

Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carri...

Full description

Bibliographic Details
Main Authors: Jelmar Quist, Lawson Taylor, Johan Staaf, Anita Grigoriadis
Format: Article
Language:English
Published: MDPI AG 2021-02-01
Series:Cancers
Subjects:
Online Access:https://www.mdpi.com/2072-6694/13/5/991
id doaj-dddc549c079e45fabb945b11e299db29
record_format Article
spelling doaj-dddc549c079e45fabb945b11e299db292021-02-28T00:02:06ZengMDPI AGCancers2072-66942021-02-011399199110.3390/cancers13050991Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer ClassificationJelmar Quist0Lawson Taylor1Johan Staaf2Anita Grigoriadis3Cancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKCancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKDivision of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-223 81 Lund, SwedenCancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKAdvances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications.https://www.mdpi.com/2072-6694/13/5/991breast cancerrandom forestmachine learningintegrative analysisDNA damage repair
collection DOAJ
language English
format Article
sources DOAJ
author Jelmar Quist
Lawson Taylor
Johan Staaf
Anita Grigoriadis
spellingShingle Jelmar Quist
Lawson Taylor
Johan Staaf
Anita Grigoriadis
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
Cancers
breast cancer
random forest
machine learning
integrative analysis
DNA damage repair
author_facet Jelmar Quist
Lawson Taylor
Johan Staaf
Anita Grigoriadis
author_sort Jelmar Quist
title Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_short Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_full Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_fullStr Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_full_unstemmed Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
title_sort random forest modelling of high-dimensional mixed-type data for breast cancer classification
publisher MDPI AG
series Cancers
issn 2072-6694
publishDate 2021-02-01
description Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications.
topic breast cancer
random forest
machine learning
integrative analysis
DNA damage repair
url https://www.mdpi.com/2072-6694/13/5/991
work_keys_str_mv AT jelmarquist randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT lawsontaylor randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT johanstaaf randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
AT anitagrigoriadis randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification
_version_ 1724247872383221760