Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics

Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study...

Full description

Bibliographic Details
Main Author:	Magana-Mora, Arturo
Other Authors:	Bajic, Vladimir B.
Language:	en
Published:	2017
Subjects:	omlnivariate decision trees Machine Learning polyadenylation signals Bioinformatics translation initiation sites Data Mining
Online Access:	http://hdl.handle.net/10754/623317

id	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-623317
record_format	oai_dc
spelling	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-6233172020-08-24T05:08:18Z Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics Magana-Mora, Arturo Bajic, Vladimir B. Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division Gojobori, Takashi Moshkov, Mikhail Wong, Limsoon omlnivariate decision trees Machine Learning polyadenylation signals Bioinformatics translation initiation sites Data Mining Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results. 2017-05-04T06:23:53Z 2018-05-04T00:00:00Z 2017-04-29 Dissertation 10.25781/KAUST-DPDH3 http://hdl.handle.net/10754/623317 en 2018-05-04 At the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation became available to the public after the expiration of the embargo on 2018-05-04.
collection	NDLTD
language	en
sources	NDLTD
topic	omlnivariate decision trees Machine Learning polyadenylation signals Bioinformatics translation initiation sites Data Mining
spellingShingle	omlnivariate decision trees Machine Learning polyadenylation signals Bioinformatics translation initiation sites Data Mining Magana-Mora, Arturo Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
description	Machine-learning (ML) techniques have been widely applied to solve different problems in biology. However, biological data are large and complex, which often result in extremely intricate ML models. Frequently, these models may have a poor performance or may be computationally unfeasible. This study presents a set of novel computational methods and focuses on the application of genetic algorithms (GAs) for the simplification and optimization of ML models and their applications to biological problems. The dissertation addresses the following three challenges. The first is to develop a generalizable classification methodology able to systematically derive competitive models despite the complexity and nature of the data. Although several algorithms for the induction of classification models have been proposed, the algorithms are data dependent. Consequently, we developed OmniGA, a novel and generalizable framework that uses different classification models in a treeXlike decision structure, along with a parallel GA for the optimization of the OmniGA structure. Results show that OmniGA consistently outperformed existing commonly used classification models. The second challenge is the prediction of translation initiation sites (TIS) in plants genomic DNA. We performed a statistical analysis of the genomic DNA and proposed a new set of discriminant features for this problem. We developed a wrapper method based on GAs for selecting an optimal feature subset, which, in conjunction with a classification model, produced the most accurate framework for the recognition of TIS in plants. Finally, results demonstrate that despite the evolutionary distance between different plants, our approach successfully identified conserved genomic elements that may serve as the starting point for the development of a generic model for prediction of TIS in eukaryotic organisms. Finally, the third challenge is the accurate prediction of polyadenylation signals in human genomic DNA. To achieve this, we analyzed genomic DNA sequences for the 12 most frequent polyadenylation signal variants and proposed a new set of features that may contribute to the understanding of the polyadenylation process. We derived Omni-PolyA, a model, and tool based on OmniGA for the prediction of the polyadenylation signals. Results show that Omni-PolyA significantly reduced the average classification error rate compared to the state-of-the-art results.
author2	Bajic, Vladimir B.
author_facet	Bajic, Vladimir B. Magana-Mora, Arturo
author	Magana-Mora, Arturo
author_sort	Magana-Mora, Arturo
title	Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_short	Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_full	Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_fullStr	Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_full_unstemmed	Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics
title_sort	genetic algorithms for optimization of machine-learning models and their applications in bioinformatics
publishDate	2017
url	http://hdl.handle.net/10754/623317
work_keys_str_mv	AT maganamoraarturo geneticalgorithmsforoptimizationofmachinelearningmodelsandtheirapplicationsinbioinformatics
_version_	1719338861941751808

Genetic Algorithms for Optimization of Machine-learning Models and their Applications in Bioinformatics

Similar Items