Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics

Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study...

Full description

Bibliographic Details
Main Author: Magana-Mora, Arturo
Other Authors: Bajic, Vladimir B.
Language:en
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10754/623317
id ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-623317
record_format oai_dc
spelling ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-6233172020-08-24T05:08:18Z Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics Magana-Mora, Arturo Bajic, Vladimir B. Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division Gojobori, Takashi Moshkov, Mikhail Wong, Limsoon omlnivariate decision trees Machine Learning polyadenylation signals Bioinformatics translation initiation sites Data Mining Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results. 2017-05-04T06:23:53Z 2018-05-04T00:00:00Z 2017-04-29 Dissertation 10.25781/KAUST-DPDH3 http://hdl.handle.net/10754/623317 en 2018-05-04 At the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation became available to the public after the expiration of the embargo on 2018-05-04.
collection NDLTD
language en
sources NDLTD
topic omlnivariate decision trees
Machine Learning
polyadenylation signals
Bioinformatics
translation initiation sites
Data Mining
spellingShingle omlnivariate decision trees
Machine Learning
polyadenylation signals
Bioinformatics
translation initiation sites
Data Mining
Magana-Mora, Arturo
Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
description Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.
author2 Bajic, Vladimir B.
author_facet Bajic, Vladimir B.
Magana-Mora, Arturo
author Magana-Mora, Arturo
author_sort Magana-Mora, Arturo
title Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_short Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_full Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_fullStr Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_full_unstemmed Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_sort genetic algorithms for optimization of machine-learning models and their applications in bioinformatics
publishDate 2017
url http://hdl.handle.net/10754/623317
work_keys_str_mv AT maganamoraarturo geneticalgorithmsforoptimizationofmachinelearningmodelsandtheirapplicationsinbioinformatics
_version_ 1719338861941751808