A Protein Sequence Clustering Algorithm Based on Statistical Models

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property...

Full description

Bibliographic Details
Main Authors:	Wen-Chin Chung, 鐘文欽
Other Authors:	Yen-Jen Oyang
Format:	Others
Language:	zh-TW
Published:	2004
Online Access:	http://ndltd.ncl.edu.tw/handle/53565740661331487469

id	ndltd-TW-092NTU05392040
record_format	oai_dc
spelling	ndltd-TW-092NTU053920402016-06-10T04:15:43Z http://ndltd.ncl.edu.tw/handle/53565740661331487469 A Protein Sequence Clustering Algorithm Based on Statistical Models 以統計模型為基礎之複合式蛋白質序列分群演算法 Wen-Chin Chung 鐘文欽碩士國立臺灣大學資訊工程學研究所 92 Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches. Yen-Jen Oyang 歐陽彥正 2004 學位論文 ; thesis 39 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.
author2	Yen-Jen Oyang
author_facet	Yen-Jen Oyang Wen-Chin Chung 鐘文欽
author	Wen-Chin Chung 鐘文欽
spellingShingle	Wen-Chin Chung 鐘文欽 A Protein Sequence Clustering Algorithm Based on Statistical Models
author_sort	Wen-Chin Chung
title	A Protein Sequence Clustering Algorithm Based on Statistical Models
title_short	A Protein Sequence Clustering Algorithm Based on Statistical Models
title_full	A Protein Sequence Clustering Algorithm Based on Statistical Models
title_fullStr	A Protein Sequence Clustering Algorithm Based on Statistical Models
title_full_unstemmed	A Protein Sequence Clustering Algorithm Based on Statistical Models
title_sort	protein sequence clustering algorithm based on statistical models
publishDate	2004
url	http://ndltd.ncl.edu.tw/handle/53565740661331487469
work_keys_str_mv	AT wenchinchung aproteinsequenceclusteringalgorithmbasedonstatisticalmodels AT zhōngwénqīn aproteinsequenceclusteringalgorithmbasedonstatisticalmodels AT wenchinchung yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ AT zhōngwénqīn yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ AT wenchinchung proteinsequenceclusteringalgorithmbasedonstatisticalmodels AT zhōngwénqīn proteinsequenceclusteringalgorithmbasedonstatisticalmodels
_version_	1718299942058983424

A Protein Sequence Clustering Algorithm Based on Statistical Models

Similar Items