Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals
The challenge in finding genes in eukaryotic organisms using computational methods is an ongoing problem in the biology. Based on various genomic signals found in eukaryotic genomes, this problem can be divided into many different sub-problems such as identification of transcription start sites, tr...
Main Author: | |
---|---|
Other Authors: | |
Language: | en |
Published: |
2014
|
Subjects: | |
Online Access: | Mulamba, P. A. (2014). Using physicochemical and compositional characteristics of DNA sequence for prediction of genomic signals. KAUST Research Repository. https://doi.org/10.25781/KAUST-VM9KK http://hdl.handle.net/10754/336791 |
Summary: | The
challenge
in
finding
genes
in
eukaryotic
organisms
using
computational
methods
is
an
ongoing
problem
in
the
biology.
Based
on
various
genomic
signals
found
in
eukaryotic
genomes,
this
problem
can
be
divided
into
many
different
sub-problems
such
as
identification
of
transcription
start
sites,
translation
initiation
sites,
splice
sites,
poly
(A)
signals,
etc.
Each
sub-problem
deals
with
a
particular
type
of
genomic
signals
and
various
computational
methods
are
used
to
solve
each
sub-problem.
Aggregating
information
from
all
these
individual
sub-problems
can
lead
to
a
complete
annotation
of
a
gene
and
its
component
signals.
The
fundamental
principle
of
most
of
these
computational
methods
is
the
mapping
principle
–
building
an
input-output
model
for
the
prediction
of
a
particular
genomic
signal
based
on
a
set
of
known
input
signals
and
their
corresponding
output
signal.
The
type
of
input
signals
used
to
build
the
model
is
an
essential
element
in
most
of
these
computational
methods.
The
common
factor
of
most
of
these
methods
is
that
they
are
mainly
based
on
the
statistical
analysis
of
the
basic
nucleotide
sequence
string
composition.
4
Our
study
is
based
on
a
novel
approach
to
predict
genomic
signals
in
which
uniquely
generated
structural
profiles
that
combine
compressed
physicochemical
properties
with
topological
and
compositional
properties
of
DNA
sequences
are
used
to
develop
machine
learning
predictive
models.
The
compression
of
the
physicochemical
properties
is
made
using
principal
component
analysis
transformation.
Our
ideas
are
evaluated
through
prediction
models
of
canonical
splice
sites
using
support
vector
machine
models.
We
demonstrate
across
several
species
that
the
proposed
methodology
has
resulted
in
the
most
accurate
splice
site
predictors
that
are
publicly
available
or
described.
We
believe
that
the
approach
in
this
study
is
quite
general
and
has
various
applications
in
other
biological
modeling
problems. |
---|