Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer

In this paper, given data with high-dimensional features, we study this problem of how to calculate the similarity between two samples by considering feature interaction network, where a feature interaction network represents the relationship between features. This is different from some traditional...

Full description

Bibliographic Details
Main Authors: Jipeng Qiang, Wei Ding, Marieke Kuijjer, John Quackenbush, Ping Chen
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9048133/
id doaj-2bf2ba0808874dbba69c614a6b728dd8
record_format Article
spelling doaj-2bf2ba0808874dbba69c614a6b728dd82021-03-30T03:18:45ZengIEEEIEEE Access2169-35362020-01-018677756778910.1109/ACCESS.2020.29825699048133Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in CancerJipeng Qiang0https://orcid.org/0000-0001-8036-9550Wei Ding1https://orcid.org/0000-0002-3383-551XMarieke Kuijjer2https://orcid.org/0000-0001-6280-3130John Quackenbush3https://orcid.org/0000-0002-2702-5879Ping Chen4https://orcid.org/0000-0003-3789-7686Department of Computer Science, Yangzhou University, Yangzhou, ChinaDepartment of Computer Science, University of Massachusetts Boston, Boston, MA, USACentre for Molecular Medicine Norway, University of Oslo Faculty of Medicine, Oslo, NorwayDepartment of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USADepartment of Computer Science, University of Massachusetts Boston, Boston, MA, USAIn this paper, given data with high-dimensional features, we study this problem of how to calculate the similarity between two samples by considering feature interaction network, where a feature interaction network represents the relationship between features. This is different from some traditional methods, those of which learn similarities based on a sample network that represents the relationship between samples. Therefore, we propose a novel network-based similarity metric for computing the similarity between samples, which incorporates the knowledge of feature interaction network, in order to overcome the data sparseness problem. Our similarity metric uses a new Feature Alignment Similarity measure, which does not directly compute the similarities among samples, but projects each sample into a feature interaction network and measures the similarities between two samples using the similarities between the vertices of the samples in the network. As such, when two samples do not share any common features, they are likely to have higher similarity values when their features share the similar network regions. For ensuring that the metric is useful in a real-world application, we apply our metric to discover subtypes in tumor mutational data by incorporating the information of the gene interaction network. Our experimental results from using synthetic data and real-world tumor mutational data show that our approach outperforms the top competitors in cancer subtype discovery. Furthermore, our approach can identify cancer subtypes that cannot be detected by other clustering algorithms in real cancer data.https://ieeexplore.ieee.org/document/9048133/Cancer subtypefeature interaction networksimilarity metricsomatic mutational data
collection DOAJ
language English
format Article
sources DOAJ
author Jipeng Qiang
Wei Ding
Marieke Kuijjer
John Quackenbush
Ping Chen
spellingShingle Jipeng Qiang
Wei Ding
Marieke Kuijjer
John Quackenbush
Ping Chen
Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
IEEE Access
Cancer subtype
feature interaction network
similarity metric
somatic mutational data
author_facet Jipeng Qiang
Wei Ding
Marieke Kuijjer
John Quackenbush
Ping Chen
author_sort Jipeng Qiang
title Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
title_short Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
title_full Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
title_fullStr Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
title_full_unstemmed Clustering Sparse Data With Feature Correlation With Application to Discover Subtypes in Cancer
title_sort clustering sparse data with feature correlation with application to discover subtypes in cancer
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2020-01-01
description In this paper, given data with high-dimensional features, we study this problem of how to calculate the similarity between two samples by considering feature interaction network, where a feature interaction network represents the relationship between features. This is different from some traditional methods, those of which learn similarities based on a sample network that represents the relationship between samples. Therefore, we propose a novel network-based similarity metric for computing the similarity between samples, which incorporates the knowledge of feature interaction network, in order to overcome the data sparseness problem. Our similarity metric uses a new Feature Alignment Similarity measure, which does not directly compute the similarities among samples, but projects each sample into a feature interaction network and measures the similarities between two samples using the similarities between the vertices of the samples in the network. As such, when two samples do not share any common features, they are likely to have higher similarity values when their features share the similar network regions. For ensuring that the metric is useful in a real-world application, we apply our metric to discover subtypes in tumor mutational data by incorporating the information of the gene interaction network. Our experimental results from using synthetic data and real-world tumor mutational data show that our approach outperforms the top competitors in cancer subtype discovery. Furthermore, our approach can identify cancer subtypes that cannot be detected by other clustering algorithms in real cancer data.
topic Cancer subtype
feature interaction network
similarity metric
somatic mutational data
url https://ieeexplore.ieee.org/document/9048133/
work_keys_str_mv AT jipengqiang clusteringsparsedatawithfeaturecorrelationwithapplicationtodiscoversubtypesincancer
AT weiding clusteringsparsedatawithfeaturecorrelationwithapplicationtodiscoversubtypesincancer
AT mariekekuijjer clusteringsparsedatawithfeaturecorrelationwithapplicationtodiscoversubtypesincancer
AT johnquackenbush clusteringsparsedatawithfeaturecorrelationwithapplicationtodiscoversubtypesincancer
AT pingchen clusteringsparsedatawithfeaturecorrelationwithapplicationtodiscoversubtypesincancer
_version_ 1724183613395697664