Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals

The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub­-problems such as identification of transcription start sites, tr...

Full description

Bibliographic Details
Main Author: Mulamba, Pierre Abraham
Other Authors: Bajic, Vladimir B.
Language:en
Published: 2014
Subjects:
Online Access:Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK
http://hdl.handle.net/10754/336791
id ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-336791
record_format oai_dc
spelling ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-3367912021-02-17T05:08:54Z Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals Mulamba, Pierre Abraham Bajic, Vladimir B. Biological and Environmental Sciences and Engineering (BESE) Division Moshkov, Mikhail Arold, Stefan T. Christoffels, Alan Physicochemical Compositional Characteristics Prediction Genomic Signals The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub­-problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-­problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-­problem. Aggregating information from all these individual sub-­problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-­output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems. 2014-12-07T13:52:27Z 2015-12-07T00:00:00Z 2014-12 Dissertation Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK 10.25781/KAUST-VM9KK http://hdl.handle.net/10754/336791 en 2015-12-07 At the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation became available to the public after the expiration of the embargo on 2015-12-07.
collection NDLTD
language en
sources NDLTD
topic Physicochemical
Compositional
Characteristics
Prediction
Genomic
Signals
spellingShingle Physicochemical
Compositional
Characteristics
Prediction
Genomic
Signals
Mulamba, Pierre Abraham
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
description The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub­-problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-­problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-­problem. Aggregating information from all these individual sub-­problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-­output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems.
author2 Bajic, Vladimir B.
author_facet Bajic, Vladimir B.
Mulamba, Pierre Abraham
author Mulamba, Pierre Abraham
author_sort Mulamba, Pierre Abraham
title Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
title_short Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
title_full Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
title_fullStr Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
title_full_unstemmed Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
title_sort using physicochemical and compositional characteristics of dna sequence for prediction of genomic signals
publishDate 2014
url Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK
http://hdl.handle.net/10754/336791
work_keys_str_mv AT mulambapierreabraham usingphysicochemicalandcompositionalcharacteristicsofdnasequenceforpredictionofgenomicsignals
_version_ 1719377401597657088