A Protein Sequence Clustering Algorithm Based on Statistical Models

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property...

Full description

Bibliographic Details
Main Authors: Wen-Chin Chung, 鐘文欽
Other Authors: Yen-Jen Oyang
Format: Others
Language:zh-TW
Published: 2004
Online Access:http://ndltd.ncl.edu.tw/handle/53565740661331487469
id ndltd-TW-092NTU05392040
record_format oai_dc
spelling ndltd-TW-092NTU053920402016-06-10T04:15:43Z http://ndltd.ncl.edu.tw/handle/53565740661331487469 A Protein Sequence Clustering Algorithm Based on Statistical Models 以統計模型為基礎之複合式蛋白質序列分群演算法 Wen-Chin Chung 鐘文欽 碩士 國立臺灣大學 資訊工程學研究所 92 Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches. Yen-Jen Oyang 歐陽彥正 2004 學位論文 ; thesis 39 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.
author2 Yen-Jen Oyang
author_facet Yen-Jen Oyang
Wen-Chin Chung
鐘文欽
author Wen-Chin Chung
鐘文欽
spellingShingle Wen-Chin Chung
鐘文欽
A Protein Sequence Clustering Algorithm Based on Statistical Models
author_sort Wen-Chin Chung
title A Protein Sequence Clustering Algorithm Based on Statistical Models
title_short A Protein Sequence Clustering Algorithm Based on Statistical Models
title_full A Protein Sequence Clustering Algorithm Based on Statistical Models
title_fullStr A Protein Sequence Clustering Algorithm Based on Statistical Models
title_full_unstemmed A Protein Sequence Clustering Algorithm Based on Statistical Models
title_sort protein sequence clustering algorithm based on statistical models
publishDate 2004
url http://ndltd.ncl.edu.tw/handle/53565740661331487469
work_keys_str_mv AT wenchinchung aproteinsequenceclusteringalgorithmbasedonstatisticalmodels
AT zhōngwénqīn aproteinsequenceclusteringalgorithmbasedonstatisticalmodels
AT wenchinchung yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ
AT zhōngwénqīn yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ
AT wenchinchung proteinsequenceclusteringalgorithmbasedonstatisticalmodels
AT zhōngwénqīn proteinsequenceclusteringalgorithmbasedonstatisticalmodels
_version_ 1718299942058983424