Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub-problems such as identification of transcription start sites, tr...
Main Author: | |
---|---|
Other Authors: | |
Language: | en |
Published: |
2014
|
Subjects: | |
Online Access: | Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK http://hdl.handle.net/10754/336791 |
id |
ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-336791 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-3367912021-02-17T05:08:54Z Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals Mulamba, Pierre Abraham Bajic, Vladimir B. Biological and Environmental Sciences and Engineering (BESE) Division Moshkov, Mikhail Arold, Stefan T. Christoffels, Alan Physicochemical Compositional Characteristics Prediction Genomic Signals The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub-problems such as identification of transcription start sites, translation initiation sites, splice sites, poly (A) signals, etc. Each sub-problem deals with a particular type of genomic signals and various computational methods are used to solve each sub-problem. Aggregating information from all these individual sub-problems can lead to a complete annotation of a gene and its component signals. The fundamental principle of most of these computational methods is the mapping principle – building an input-output model for the prediction of a particular genomic signal based on a set of known input signals and their corresponding output signal. The type of input signals used to build the model is an essential element in most of these computational methods. The common factor of most of these methods is that they are mainly based on the statistical analysis of the basic nucleotide sequence string composition. 4 Our study is based on a novel approach to predict genomic signals in which uniquely generated structural profiles that combine compressed physicochemical properties with topological and compositional properties of DNA sequences are used to develop machine learning predictive models. The compression of the physicochemical properties is made using principal component analysis transformation. Our ideas are evaluated through prediction models of canonical splice sites using support vector machine models. We demonstrate across several species that the proposed methodology has resulted in the most accurate splice site predictors that are publicly available or described. We believe that the approach in this study is quite general and has various applications in other biological modeling problems. 2014-12-07T13:52:27Z 2015-12-07T00:00:00Z 2014-12 Dissertation Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK 10.25781/KAUST-VM9KK http://hdl.handle.net/10754/336791 en 2015-12-07 At the time of archiving, the student author of this dissertation opted to temporarily restrict access to it. The full text of this dissertation became available to the public after the expiration of the embargo on 2015-12-07. |
collection |
NDLTD |
language |
en |
sources |
NDLTD |
topic |
Physicochemical Compositional Characteristics Prediction Genomic Signals |
spellingShingle |
Physicochemical Compositional Characteristics Prediction Genomic Signals Mulamba, Pierre Abraham Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
description |
The
challenge
in
finding
genes
in
eukaryotic
organisms
using
computational
methods
is
an
ongoing
problem
in
the
biology.
Based
on
various
genomic
signals
found
in
eukaryotic
genomes,
this
problem
can
be
divided
into
many
different
sub-problems
such
as
identification
of
transcription
start
sites,
translation
initiation
sites,
splice
sites,
poly
(A)
signals,
etc.
Each
sub-problem
deals
with
a
particular
type
of
genomic
signals
and
various
computational
methods
are
used
to
solve
each
sub-problem.
Aggregating
information
from
all
these
individual
sub-problems
can
lead
to
a
complete
annotation
of
a
gene
and
its
component
signals.
The
fundamental
principle
of
most
of
these
computational
methods
is
the
mapping
principle
–
building
an
input-output
model
for
the
prediction
of
a
particular
genomic
signal
based
on
a
set
of
known
input
signals
and
their
corresponding
output
signal.
The
type
of
input
signals
used
to
build
the
model
is
an
essential
element
in
most
of
these
computational
methods.
The
common
factor
of
most
of
these
methods
is
that
they
are
mainly
based
on
the
statistical
analysis
of
the
basic
nucleotide
sequence
string
composition.
4
Our
study
is
based
on
a
novel
approach
to
predict
genomic
signals
in
which
uniquely
generated
structural
profiles
that
combine
compressed
physicochemical
properties
with
topological
and
compositional
properties
of
DNA
sequences
are
used
to
develop
machine
learning
predictive
models.
The
compression
of
the
physicochemical
properties
is
made
using
principal
component
analysis
transformation.
Our
ideas
are
evaluated
through
prediction
models
of
canonical
splice
sites
using
support
vector
machine
models.
We
demonstrate
across
several
species
that
the
proposed
methodology
has
resulted
in
the
most
accurate
splice
site
predictors
that
are
publicly
available
or
described.
We
believe
that
the
approach
in
this
study
is
quite
general
and
has
various
applications
in
other
biological
modeling
problems. |
author2 |
Bajic, Vladimir B. |
author_facet |
Bajic, Vladimir B. Mulamba, Pierre Abraham |
author |
Mulamba, Pierre Abraham |
author_sort |
Mulamba, Pierre Abraham |
title |
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
title_short |
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
title_full |
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
title_fullStr |
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
title_full_unstemmed |
Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals |
title_sort |
using physicochemical and compositional characteristics of dna sequence for prediction of genomic signals |
publishDate |
2014 |
url |
Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK http://hdl.handle.net/10754/336791 |
work_keys_str_mv |
AT mulambapierreabraham usingphysicochemicalandcompositionalcharacteristicsofdnasequenceforpredictionofgenomicsignals |
_version_ |
1719377401597657088 |