Summary: | 碩士 === 國立成功大學 === 資訊管理研究所 === 100 === Operational Taxonomy Unit (OTU) analysis is essential to metagenomics. It retrieves gene sequence data through a variety of ways, and employs gene alignment and clustering methods to determine the gene clusters that can affect the interactions between microbes and ecological environment. In recent studies, short gene sequences with high variability are used for reducing cost. Since the number of genes is huge, clustering methods with computational efficiency are developed to find stable clustering results. The methods for gene alignment to make all gene sequence data have the same length for similarity calculation can also affect clustering results. Many clustering methods therefore have been developed for OTU analysis, and most of them are evaluated by their computational efficiency and the number of clusters. No measures have been established to objectively evaluate the clustering results of gene sequence data. We first select an index proposed by a recent study to test its validity, and find that it is not an effective one for evaluating the clustering results of gene sequence data. We therefore pick two different concepts to design indexes for evaluating the performance of five clustering methods applied on gene sequence data in this study. The experimental results obtained from two gene sequence data sets show that in any specific level in the taxonomy, the thresholds for various clustering methods must be different. The evaluation results measured by the supervised and unsupervised indexes proposed by this study are not consistent, and the supervised index is more reliable.
|