On the Study of Using Topic Keyword Clusters for Docuemnt Clustering
博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2005
|
Online Access: | http://ndltd.ncl.edu.tw/handle/13319360426102780257 |
id |
ndltd-TW-093NTUST396051 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-093NTUST3960512015-10-13T12:56:38Z http://ndltd.ncl.edu.tw/handle/13319360426102780257 On the Study of Using Topic Keyword Clusters for Docuemnt Clustering 使用主題詞彙群組進行文件分群之研究 Hsi-Cheng Chang 張錫正 博士 國立臺灣科技大學 資訊管理系 93 The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods. Chiun-Chieh Hsu 徐俊傑 2005 學位論文 ; thesis 89 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection.
This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods.
|
author2 |
Chiun-Chieh Hsu |
author_facet |
Chiun-Chieh Hsu Hsi-Cheng Chang 張錫正 |
author |
Hsi-Cheng Chang 張錫正 |
spellingShingle |
Hsi-Cheng Chang 張錫正 On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
author_sort |
Hsi-Cheng Chang |
title |
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
title_short |
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
title_full |
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
title_fullStr |
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
title_full_unstemmed |
On the Study of Using Topic Keyword Clusters for Docuemnt Clustering |
title_sort |
on the study of using topic keyword clusters for docuemnt clustering |
publishDate |
2005 |
url |
http://ndltd.ncl.edu.tw/handle/13319360426102780257 |
work_keys_str_mv |
AT hsichengchang onthestudyofusingtopickeywordclustersfordocuemntclustering AT zhāngxīzhèng onthestudyofusingtopickeywordclustersfordocuemntclustering AT hsichengchang shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū AT zhāngxīzhèng shǐyòngzhǔtícíhuìqúnzǔjìnxíngwénjiànfēnqúnzhīyánjiū |
_version_ |
1716869945998966784 |