On the Study of Using Topic Keyword Clusters for Docuemnt Clustering

博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key...

Full description

Bibliographic Details
Main Authors:	Hsi-Cheng Chang, 張錫正
Other Authors:	Chiun-Chieh Hsu
Format:	Others
Language:	en_US
Published:	2005
Online Access:	http://ndltd.ncl.edu.tw/handle/13319360426102780257

id	ndltd-TW-093NTUST396051
record_format	oai_dc
spelling	ndltd-TW-093NTUST3960512015-10-13T12:56:38Z http://ndltd.ncl.edu.tw/handle/13319360426102780257 On the Study of Using Topic Keyword Clusters for Docuemnt Clustering 使用主題詞彙群組進行文件分群之研究 Hsi-Cheng Chang 張錫正博士國立臺灣科技大學資訊管理系 93 The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods. Chiun-Chieh Hsu 徐俊傑 2005 學位論文 ; thesis 89 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods.
author2	Chiun-Chieh Hsu
author_facet	Chiun-Chieh Hsu Hsi-Cheng Chang 張錫正
author	Hsi-Cheng Chang 張錫正
spellingShingle	Hsi-Cheng Chang 張錫正 On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
author_sort	Hsi-Cheng Chang
title	On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_short	On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_full	On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_fullStr	On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_full_unstemmed	On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_sort	on the study of using topic keyword clusters for docuemnt clustering
publishDate	2005
url	http://ndltd.ncl.edu.tw/handle/13319360426102780257
work_keys_str_mv	AT hsichengchang onthestudyofusingtopickeywordclustersfordocuemntclustering AT zhāngxīzhèng onthestudyofusingtopickeywordclustersfordocuemntclustering AT hsichengchang shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū AT zhāngxīzhèng shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū
_version_	1716869945998966784

On the Study of Using Topic Keyword Clusters for Docuemnt Clustering

Similar Items