Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.

BACKGROUND:Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster anal...

Full description

Bibliographic Details
Main Authors: Linda Vidman, David Källberg, Patrik Rydén
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2019-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0219102
id doaj-59b82c3723a64ee08c58dc6c59a1fc5a
record_format Article
spelling doaj-59b82c3723a64ee08c58dc6c59a1fc5a2021-03-03T21:16:23ZengPublic Library of Science (PLoS)PLoS ONE1932-62032019-01-011412e021910210.1371/journal.pone.0219102Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.Linda VidmanDavid KällbergPatrik RydénBACKGROUND:Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS:In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS:The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.https://doi.org/10.1371/journal.pone.0219102
collection DOAJ
language English
format Article
sources DOAJ
author Linda Vidman
David Källberg
Patrik Rydén
spellingShingle Linda Vidman
David Källberg
Patrik Rydén
Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
PLoS ONE
author_facet Linda Vidman
David Källberg
Patrik Rydén
author_sort Linda Vidman
title Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
title_short Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
title_full Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
title_fullStr Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
title_full_unstemmed Cluster analysis on high dimensional RNA-seq data with applications to cancer research - An evaluation study.
title_sort cluster analysis on high dimensional rna-seq data with applications to cancer research - an evaluation study.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2019-01-01
description BACKGROUND:Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance. RESULTS:In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males. CONCLUSIONS:The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.
url https://doi.org/10.1371/journal.pone.0219102
work_keys_str_mv AT lindavidman clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy
AT davidkallberg clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy
AT patrikryden clusteranalysisonhighdimensionalrnaseqdatawithapplicationstocancerresearchanevaluationstudy
_version_ 1714817819264155648