Summary: | Identification of non-protein-coding RNAs (ncRNAs) in genomes is acrucial task for not only molecular cell biology but alsobioinformatics. Secondary structures of ncRNAs are employed as a keyfeature of ncRNA analysis since biological functions of ncRNAs aredeeply related to their secondary structures. Although the minimumfree energy (MFE) structure of an RNA sequence is regarded as the moststable structure, MFE alone could not be an appropriate measure foridentifying ncRNAs since the free energy is heavily biased by thenucleotide composition. Therefore, instead of MFE itself, severalalternative measures for identifying ncRNAs have been proposed such asthe structure conservation index (SCI) and the base pair distance(BPD), both of which employ MFE structures. However, thesemeasurements are unfortunately not suitable for identifying ncRNAs insome cases including the genome-wide search and incur high falsediscovery rate. In this study, we propose improved measurements basedon SCI and BPD, applying generalized centroid estimators toincorporate the robustness against low quality multiple alignments.Our experiments show that our proposed methods achieve higher accuracythan the original SCI and BPD for not only human-curated structuralalignments but also low quality alignments produced by CLUSTALW. Furthermore, the centroid-based SCI on CLUSTAL W alignments is moreaccurate than or comparable with that of the original SCI onstructural alignments generated with RAF, a high quality structuralaligner, for which two-fold expensive computational time is requiredon average. We conclude that our methods are more suitable forgenome-wide alignments which are of low quality from the point of viewon secondary structures than the original SCI and BPD.
|