Summary: | 碩士 === 亞洲大學 === 生物資訊學系碩士班 === 99 === Due to the availability of virus genome sequences nowadays, there
provides a new approach to virus classification from the view point of
molecular biology point of view, instead of from traditional morphol-
ogy. To use the classifiers available in the vector space model, it is
important to transfer the instances of virus into representative vectors.
To transfer the instances of viruses (genomic sequences) into vectors
as the input of experiments for virus classification, in this study, we
adopted the k-mer(k) approach for pattern extraction and used the
entropy of pattern distirbution for pattern weighting. To inspect the
different effectiveness of coding/non-coding regions within one DNA
nucleotide sequence, there were 4 types, ”ALL”, ”Coding”, ”NonCod-
ing, and ”DirectedCoding”, of sequences extracted individually as the
input for classification comparison. Experimental resources of viral
genomes were downloaded from the NCBI and included 22 virus fami-
lies consisting of ”1,601” virus species. Meanwhile, the values of the k
ranged from 1 to 6 were evaluated for experiments. The results showed
that the highest accuracy achieved by well known SVM classifier was
95.6%,by using the sequences of type ”ALL” when k = 5 . Further-
more, the accuracy achieved via the ”DirectedCoding” was higher than
that avhieved via the ”Coding”. It was out of our expectation that
the accuracy achieved by using the sequence type of ”NonCoding”
was as high as ”90%” when k = 6. This observation revealed that
some information conserved in non-coding region (that)where worthy
for further investigation for biologist.
|