Virtual Screening of Drug Proteins Based on Imbalance Data Mining

To address the imbalanced data problem in molecular docking-based virtual screening methods, this paper proposes a virtual screening method for drug proteins based on imbalanced data mining, which introduces machine learning technology into the virtual screening technology for drug proteins to deal...

Full description

Bibliographic Details
Main Authors:	Peng Li, Lili Yin, Bo Zhao, Yuezhongyi Sun
Format:	Article
Language:	English
Published:	Hindawi Limited 2021-01-01
Series:	Mathematical Problems in Engineering
Online Access:	http://dx.doi.org/10.1155/2021/5585990

id	doaj-6342a990534b4c35951817d7f850f59a
record_format	Article
spelling	doaj-6342a990534b4c35951817d7f850f59a2021-06-07T02:12:59ZengHindawi LimitedMathematical Problems in Engineering1563-51472021-01-01202110.1155/2021/5585990Virtual Screening of Drug Proteins Based on Imbalance Data MiningPeng Li0Lili Yin1Bo Zhao2Yuezhongyi Sun3School of Computer Science and TechnologySchool of Computer Science and TechnologySchool of Computer Science and TechnologySchool of Computer Science and TechnologyTo address the imbalanced data problem in molecular docking-based virtual screening methods, this paper proposes a virtual screening method for drug proteins based on imbalanced data mining, which introduces machine learning technology into the virtual screening technology for drug proteins to deal with the imbalanced data problem in the virtual screening process and improve the accuracy of the virtual screening. First, to address the data imbalance problem caused by the large difference between the number of active compounds and the number of inactive compounds in the docking conformation generated by the actual virtual screening process, this paper proposes a way to improve the data imbalance problem using SMOTE combined with genetic algorithm to synthesize new active compounds artificially by upsampling active compounds. Then, in order to improve the accuracy in the virtual screening process of drug proteins, the idea of integrated learning is introduced, and the random forest (RF) extended from Bagging integrated learning technique is combined with the support vector machine (SVM) technique, and the virtual screening of molecular docking conformations using RF-SVM technique is proposed to improve the prediction accuracy of active compounds in docking conformations. To verify the effectiveness of the proposed technique, first, HIV-1 protease and SRC kinase were used as test data for the experiments, and then, CA II was used to validate the model of the test data. The virtual screening of drug proteins using the proposed method in this paper showed an improvement in both enrichment factor (EF) and AUC compared with the use of the traditional virtual screening, for the test dataset. Therefore, it can be shown that the proposed method can effectively improve the accuracy of drug virtual screening.http://dx.doi.org/10.1155/2021/5585990
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Peng Li Lili Yin Bo Zhao Yuezhongyi Sun
spellingShingle	Peng Li Lili Yin Bo Zhao Yuezhongyi Sun Virtual Screening of Drug Proteins Based on Imbalance Data Mining Mathematical Problems in Engineering
author_facet	Peng Li Lili Yin Bo Zhao Yuezhongyi Sun
author_sort	Peng Li
title	Virtual Screening of Drug Proteins Based on Imbalance Data Mining
title_short	Virtual Screening of Drug Proteins Based on Imbalance Data Mining
title_full	Virtual Screening of Drug Proteins Based on Imbalance Data Mining
title_fullStr	Virtual Screening of Drug Proteins Based on Imbalance Data Mining
title_full_unstemmed	Virtual Screening of Drug Proteins Based on Imbalance Data Mining
title_sort	virtual screening of drug proteins based on imbalance data mining
publisher	Hindawi Limited
series	Mathematical Problems in Engineering
issn	1563-5147
publishDate	2021-01-01
description	To address the imbalanced data problem in molecular docking-based virtual screening methods, this paper proposes a virtual screening method for drug proteins based on imbalanced data mining, which introduces machine learning technology into the virtual screening technology for drug proteins to deal with the imbalanced data problem in the virtual screening process and improve the accuracy of the virtual screening. First, to address the data imbalance problem caused by the large difference between the number of active compounds and the number of inactive compounds in the docking conformation generated by the actual virtual screening process, this paper proposes a way to improve the data imbalance problem using SMOTE combined with genetic algorithm to synthesize new active compounds artificially by upsampling active compounds. Then, in order to improve the accuracy in the virtual screening process of drug proteins, the idea of integrated learning is introduced, and the random forest (RF) extended from Bagging integrated learning technique is combined with the support vector machine (SVM) technique, and the virtual screening of molecular docking conformations using RF-SVM technique is proposed to improve the prediction accuracy of active compounds in docking conformations. To verify the effectiveness of the proposed technique, first, HIV-1 protease and SRC kinase were used as test data for the experiments, and then, CA II was used to validate the model of the test data. The virtual screening of drug proteins using the proposed method in this paper showed an improvement in both enrichment factor (EF) and AUC compared with the use of the traditional virtual screening, for the test dataset. Therefore, it can be shown that the proposed method can effectively improve the accuracy of drug virtual screening.
url	http://dx.doi.org/10.1155/2021/5585990
work_keys_str_mv	AT pengli virtualscreeningofdrugproteinsbasedonimbalancedatamining AT liliyin virtualscreeningofdrugproteinsbasedonimbalancedatamining AT bozhao virtualscreeningofdrugproteinsbasedonimbalancedatamining AT yuezhongyisun virtualscreeningofdrugproteinsbasedonimbalancedatamining
_version_	1721393228149686272

Virtual Screening of Drug Proteins Based on Imbalance Data Mining

Similar Items