Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion

Complex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period a...

Full description

Bibliographic Details
Main Authors:	Chunyu Wang, Jie Zhang, Xueping Wang, Ke Han, Maozu Guo
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2020-02-01
Series:	Frontiers in Genetics
Subjects:	pathogenic gene prediction induction matrix completion compact feature learning PU-Learning mean percentile ranking
Online Access:	https://www.frontiersin.org/article/10.3389/fgene.2020.00005/full

id	doaj-214576500efe4a6c83ae5fe0527bb42a
record_format	Article
spelling	doaj-214576500efe4a6c83ae5fe0527bb42a2020-11-24T22:09:23ZengFrontiers Media S.A.Frontiers in Genetics1664-80212020-02-011110.3389/fgene.2020.00005514814Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information FusionChunyu Wang0Jie Zhang1Xueping Wang2Ke Han3Maozu Guo4Maozu Guo5School of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer and Information Engineering, Harbin University of Commerce, Harbin, ChinaSchool of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, ChinaBeijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing University of Civil Engineering and Architecture, Beijing, ChinaComplex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period and high cost of traditional biomedical methods, researchers have proposed many gene prioritization algorithms that use a large amount of biological data to mine pathogenic genes. However, because the currently known gene–disease association matrix is still very sparse and lacks evidence that genes and diseases are unrelated, there are limits to the predictive performance of gene prioritization algorithms. Based on the hypothesis that functionally related gene mutations may lead to similar disease phenotypes, this paper proposes a PU induction matrix completion algorithm based on heterogeneous information fusion (PUIMCHIF) to predict candidate genes involved in the pathogenicity of human diseases. On the one hand, PUIMCHIF uses different compact feature learning methods to extract features of genes and diseases from multiple data sources, making up for the lack of sparse data. On the other hand, based on the prior knowledge that most of the unknown gene–disease associations are unrelated, we use the PU-Learning strategy to treat the unknown unlabeled data as negative examples for biased learning. The experimental results of the PUIMCHIF algorithm regarding the three indexes of precision, recall, and mean percentile ranking (MPR) were significantly better than those of other algorithms. In the top 100 global prediction analysis of multiple genes and multiple diseases, the probability of recovering true gene associations using PUIMCHIF reached 50% and the MPR value was 10.94%. The PUIMCHIF algorithm has higher priority than those from other methods, such as IMC and CATAPULT.https://www.frontiersin.org/article/10.3389/fgene.2020.00005/fullpathogenic gene predictioninduction matrix completioncompact feature learningPU-Learningmean percentile ranking
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Chunyu Wang Jie Zhang Xueping Wang Ke Han Maozu Guo Maozu Guo
spellingShingle	Chunyu Wang Jie Zhang Xueping Wang Ke Han Maozu Guo Maozu Guo Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion Frontiers in Genetics pathogenic gene prediction induction matrix completion compact feature learning PU-Learning mean percentile ranking
author_facet	Chunyu Wang Jie Zhang Xueping Wang Ke Han Maozu Guo Maozu Guo
author_sort	Chunyu Wang
title	Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion
title_short	Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion
title_full	Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion
title_fullStr	Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion
title_full_unstemmed	Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion
title_sort	pathogenic gene prediction algorithm based on heterogeneous information fusion
publisher	Frontiers Media S.A.
series	Frontiers in Genetics
issn	1664-8021
publishDate	2020-02-01
description	Complex diseases seriously affect people's physical and mental health. The discovery of disease-causing genes has become a target of research. With the emergence of bioinformatics and the rapid development of biotechnology, to overcome the inherent difficulties of the long experimental period and high cost of traditional biomedical methods, researchers have proposed many gene prioritization algorithms that use a large amount of biological data to mine pathogenic genes. However, because the currently known gene–disease association matrix is still very sparse and lacks evidence that genes and diseases are unrelated, there are limits to the predictive performance of gene prioritization algorithms. Based on the hypothesis that functionally related gene mutations may lead to similar disease phenotypes, this paper proposes a PU induction matrix completion algorithm based on heterogeneous information fusion (PUIMCHIF) to predict candidate genes involved in the pathogenicity of human diseases. On the one hand, PUIMCHIF uses different compact feature learning methods to extract features of genes and diseases from multiple data sources, making up for the lack of sparse data. On the other hand, based on the prior knowledge that most of the unknown gene–disease associations are unrelated, we use the PU-Learning strategy to treat the unknown unlabeled data as negative examples for biased learning. The experimental results of the PUIMCHIF algorithm regarding the three indexes of precision, recall, and mean percentile ranking (MPR) were significantly better than those of other algorithms. In the top 100 global prediction analysis of multiple genes and multiple diseases, the probability of recovering true gene associations using PUIMCHIF reached 50% and the MPR value was 10.94%. The PUIMCHIF algorithm has higher priority than those from other methods, such as IMC and CATAPULT.
topic	pathogenic gene prediction induction matrix completion compact feature learning PU-Learning mean percentile ranking
url	https://www.frontiersin.org/article/10.3389/fgene.2020.00005/full
work_keys_str_mv	AT chunyuwang pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion AT jiezhang pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion AT xuepingwang pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion AT kehan pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion AT maozuguo pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion AT maozuguo pathogenicgenepredictionalgorithmbasedonheterogeneousinformationfusion
_version_	1725812194050834432

Pathogenic Gene Prediction Algorithm Based on Heterogeneous Information Fusion

Similar Items