A Protein Sequence Clustering Algorithm Based on Statistical Models
碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2004
|
Online Access: | http://ndltd.ncl.edu.tw/handle/53565740661331487469 |
id |
ndltd-TW-092NTU05392040 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-092NTU053920402016-06-10T04:15:43Z http://ndltd.ncl.edu.tw/handle/53565740661331487469 A Protein Sequence Clustering Algorithm Based on Statistical Models 以統計模型為基礎之複合式蛋白質序列分群演算法 Wen-Chin Chung 鐘文欽 碩士 國立臺灣大學 資訊工程學研究所 92 Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches. Yen-Jen Oyang 歐陽彥正 2004 學位論文 ; thesis 39 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立臺灣大學 === 資訊工程學研究所 === 92 === Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.
|
author2 |
Yen-Jen Oyang |
author_facet |
Yen-Jen Oyang Wen-Chin Chung 鐘文欽 |
author |
Wen-Chin Chung 鐘文欽 |
spellingShingle |
Wen-Chin Chung 鐘文欽 A Protein Sequence Clustering Algorithm Based on Statistical Models |
author_sort |
Wen-Chin Chung |
title |
A Protein Sequence Clustering Algorithm Based on Statistical Models |
title_short |
A Protein Sequence Clustering Algorithm Based on Statistical Models |
title_full |
A Protein Sequence Clustering Algorithm Based on Statistical Models |
title_fullStr |
A Protein Sequence Clustering Algorithm Based on Statistical Models |
title_full_unstemmed |
A Protein Sequence Clustering Algorithm Based on Statistical Models |
title_sort |
protein sequence clustering algorithm based on statistical models |
publishDate |
2004 |
url |
http://ndltd.ncl.edu.tw/handle/53565740661331487469 |
work_keys_str_mv |
AT wenchinchung aproteinsequenceclusteringalgorithmbasedonstatisticalmodels AT zhōngwénqīn aproteinsequenceclusteringalgorithmbasedonstatisticalmodels AT wenchinchung yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ AT zhōngwénqīn yǐtǒngjìmóxíngwèijīchǔzhīfùhéshìdànbáizhìxùlièfēnqúnyǎnsuànfǎ AT wenchinchung proteinsequenceclusteringalgorithmbasedonstatisticalmodels AT zhōngwénqīn proteinsequenceclusteringalgorithmbasedonstatisticalmodels |
_version_ |
1718299942058983424 |