A Study of Virus Classification via Genomic DNA Sequences

碩士 === 亞洲大學 === 生物資訊學系碩士班 === 99 === Due to the availability of virus genome sequences nowadays, there provides a new approach to virus classification from the view point of molecular biology point of view, instead of from traditional morphol- ogy. To use the classifiers available in the vector spac...

Full description

Bibliographic Details
Main Authors: Jig-Fu Huang, 黃進福
Other Authors: Jing-Doo Wang
Format: Others
Language:zh-TW
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/57017642093554323280
Description
Summary:碩士 === 亞洲大學 === 生物資訊學系碩士班 === 99 === Due to the availability of virus genome sequences nowadays, there provides a new approach to virus classification from the view point of molecular biology point of view, instead of from traditional morphol- ogy. To use the classifiers available in the vector space model, it is important to transfer the instances of virus into representative vectors. To transfer the instances of viruses (genomic sequences) into vectors as the input of experiments for virus classification, in this study, we adopted the k-mer(k) approach for pattern extraction and used the entropy of pattern distirbution for pattern weighting. To inspect the different effectiveness of coding/non-coding regions within one DNA nucleotide sequence, there were 4 types, ”ALL”, ”Coding”, ”NonCod- ing, and ”DirectedCoding”, of sequences extracted individually as the input for classification comparison. Experimental resources of viral genomes were downloaded from the NCBI and included 22 virus fami- lies consisting of ”1,601” virus species. Meanwhile, the values of the k ranged from 1 to 6 were evaluated for experiments. The results showed that the highest accuracy achieved by well known SVM classifier was 95.6%,by using the sequences of type ”ALL” when k = 5 . Further- more, the accuracy achieved via the ”DirectedCoding” was higher than that avhieved via the ”Coding”. It was out of our expectation that the accuracy achieved by using the sequence type of ”NonCoding” was as high as ”90%” when k = 6. This observation revealed that some information conserved in non-coding region (that)where worthy for further investigation for biologist.