On the Study of Using Topic Keyword Clusters for Docuemnt Clustering

博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key...

Full description

Bibliographic Details
Main Authors: Hsi-Cheng Chang, 張錫正
Other Authors: Chiun-Chieh Hsu
Format: Others
Language:en_US
Published: 2005
Online Access:http://ndltd.ncl.edu.tw/handle/13319360426102780257
id ndltd-TW-093NTUST396051
record_format oai_dc
spelling ndltd-TW-093NTUST3960512015-10-13T12:56:38Z http://ndltd.ncl.edu.tw/handle/13319360426102780257 On the Study of Using Topic Keyword Clusters for Docuemnt Clustering 使用主題詞彙群組進行文件分群之研究 Hsi-Cheng Chang 張錫正 博士 國立臺灣科技大學 資訊管理系 93 The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods. Chiun-Chieh Hsu 徐俊傑 2005 學位論文 ; thesis 89 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods.
author2 Chiun-Chieh Hsu
author_facet Chiun-Chieh Hsu
Hsi-Cheng Chang
張錫正
author Hsi-Cheng Chang
張錫正
spellingShingle Hsi-Cheng Chang
張錫正
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
author_sort Hsi-Cheng Chang
title On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_short On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_full On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_fullStr On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_full_unstemmed On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
title_sort on the study of using topic keyword clusters for docuemnt clustering
publishDate 2005
url http://ndltd.ncl.edu.tw/handle/13319360426102780257
work_keys_str_mv AT hsichengchang onthestudyofusingtopickeywordclustersfordocuemntclustering
AT zhāngxīzhèng onthestudyofusingtopickeywordclustersfordocuemntclustering
AT hsichengchang shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū
AT zhāngxīzhèng shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū
_version_ 1716869945998966784