id ndltd-OhioLink-oai-etd.ohiolink.edu-akron1524661027035591
record_format oai_dc
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-akron15246610270355912021-08-03T07:06:26Z Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development Chen, Jonathan Jun Feng Bioinformatics Biology Biochemistry Chemical Engineering Computer Science vHTS virtual high-throughput screening bioinformatics biological informatics machine-learning QSAR quantitative structure-activity relationship cheminformatics chemical chemoinformatics data-mining pipeline PubChem drug discovery Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of genes, proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed.The work presented used and processed data from the NLM to identify new candidates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound structural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vector machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Compound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of candidates and compounds from the training set were compared to identify structure-activity relationships for additional avenues of inquiry.Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility. 2018-05-23 English text University of Akron / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591 http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws.
collection NDLTD
language English
sources NDLTD
topic Bioinformatics
Biology
Biochemistry
Chemical Engineering
Computer Science
vHTS
virtual
high-throughput
screening
bioinformatics
biological
informatics
machine-learning
QSAR
quantitative structure-activity relationship
cheminformatics
chemical
chemoinformatics
data-mining
pipeline
PubChem
drug
discovery
spellingShingle Bioinformatics
Biology
Biochemistry
Chemical Engineering
Computer Science
vHTS
virtual
high-throughput
screening
bioinformatics
biological
informatics
machine-learning
QSAR
quantitative structure-activity relationship
cheminformatics
chemical
chemoinformatics
data-mining
pipeline
PubChem
drug
discovery
Chen, Jonathan Jun Feng
Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
author Chen, Jonathan Jun Feng
author_facet Chen, Jonathan Jun Feng
author_sort Chen, Jonathan Jun Feng
title Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
title_short Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
title_full Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
title_fullStr Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
title_full_unstemmed Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
title_sort data mining/machine learning techniques for drug discovery: computational and experimental pipeline development
publisher University of Akron / OhioLINK
publishDate 2018
url http://rave.ohiolink.edu/etdc/view?acc_num=akron1524661027035591
work_keys_str_mv AT chenjonathanjunfeng dataminingmachinelearningtechniquesfordrugdiscoverycomputationalandexperimentalpipelinedevelopment
_version_ 1719453524417314816