Summary: | <p>Abstract</p> <p>Background</p> <p>The exact origin of the cause of the Severe Acute Respiratory Syndrome (SARS) is still an open question. The genomic sequence relationship of SARS-CoV with 30 different single-stranded RNA (<it>ss</it>RNA) viruses of various families was studied using two non-standard approaches. Both approaches began with the vectorial profiling of the tetra-nucleotide usage pattern <it>V </it>for each virus. In approach one, a distance measure of a vector <it>V</it>, based on correlation coefficient was devised to construct a relationship tree by the neighbor-joining algorithm. In approach two, a multivariate factor analysis was performed to derive the embedded tetra-nucleotide usage patterns. These patterns were subsequently used to classify the selected viruses.</p> <p>Results</p> <p>Both approaches yielded relationship outcomes that are consistent with the known virus classification. They also indicated that the genome of RNA viruses from the same family conform to a specific pattern of word usage. Based on the correlation of the overall tetra-nucleotide usage patterns, the Transmissible Gastroenteritis Virus (TGV) and the Feline CoronaVirus (FCoV) are closest to SARS-CoV. Surprisingly also, the RNA viruses that do not go through a DNA stage displayed a remarkable discrimination against the CpG and UpA di-nucleotide (<it>z </it>= -77.31, -52.48 respectively) and selection for UpG and CpA (<it>z </it>= 65.79,49.99 respectively). Potential factors influencing these biases are discussed.</p> <p>Conclusion</p> <p>The study of genomic word usage is a powerful method to classify RNA viruses. The congruence of the relationship outcomes with the known classification indicates that there exist phylogenetic signals in the tetra-nucleotide usage patterns, that is most prominent in the replicase open reading frames.</p>
|