Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification
Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carri...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-02-01
|
Series: | Cancers |
Subjects: | |
Online Access: | https://www.mdpi.com/2072-6694/13/5/991 |
id |
doaj-dddc549c079e45fabb945b11e299db29 |
---|---|
record_format |
Article |
spelling |
doaj-dddc549c079e45fabb945b11e299db292021-02-28T00:02:06ZengMDPI AGCancers2072-66942021-02-011399199110.3390/cancers13050991Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer ClassificationJelmar Quist0Lawson Taylor1Johan Staaf2Anita Grigoriadis3Cancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKCancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKDivision of Oncology, Department of Clinical Sciences Lund, Lund University, Medicon Village, SE-223 81 Lund, SwedenCancer Bioinformatics, Cancer Centre at Guy’s Hospital, King’s College London, London SE1 9RT, UKAdvances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications.https://www.mdpi.com/2072-6694/13/5/991breast cancerrandom forestmachine learningintegrative analysisDNA damage repair |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jelmar Quist Lawson Taylor Johan Staaf Anita Grigoriadis |
spellingShingle |
Jelmar Quist Lawson Taylor Johan Staaf Anita Grigoriadis Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification Cancers breast cancer random forest machine learning integrative analysis DNA damage repair |
author_facet |
Jelmar Quist Lawson Taylor Johan Staaf Anita Grigoriadis |
author_sort |
Jelmar Quist |
title |
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification |
title_short |
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification |
title_full |
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification |
title_fullStr |
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification |
title_full_unstemmed |
Random Forest Modelling of High-Dimensional Mixed-Type Data for Breast Cancer Classification |
title_sort |
random forest modelling of high-dimensional mixed-type data for breast cancer classification |
publisher |
MDPI AG |
series |
Cancers |
issn |
2072-6694 |
publishDate |
2021-02-01 |
description |
Advances in high-throughput technologies encourage the generation of large amounts of multiomics data to investigate complex diseases, including breast cancer. Given that the aetiologies of such diseases extend beyond a single biological entity, and that essential biological information can be carried by all data regardless of data type, integrative analyses are needed to identify clinically relevant patterns. To facilitate such analyses, we present a permutation-based framework for random forest methods which simultaneously allows the unbiased integration of mixed-type data and assessment of relative feature importance. Through simulation studies and machine learning datasets, the performance of the approach was evaluated. The results showed minimal multicollinearity and limited overfitting. To further assess the performance, the permutation-based framework was applied to high-dimensional mixed-type data from two independent breast cancer cohorts. Reproducibility and robustness of our approach was demonstrated by the concordance in relative feature importance between the cohorts, along with consistencies in clustering profiles. One of the identified clusters was shown to be prognostic for clinical outcome after standard-of-care adjuvant chemotherapy and outperformed current intrinsic molecular breast cancer classifications. |
topic |
breast cancer random forest machine learning integrative analysis DNA damage repair |
url |
https://www.mdpi.com/2072-6694/13/5/991 |
work_keys_str_mv |
AT jelmarquist randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification AT lawsontaylor randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification AT johanstaaf randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification AT anitagrigoriadis randomforestmodellingofhighdimensionalmixedtypedataforbreastcancerclassification |
_version_ |
1724247872383221760 |