On the Study of Using Topic Keyword Clusters for Docuemnt Clustering

博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key...

Full description

Bibliographic Details
Main Authors: Hsi-Cheng Chang, 張錫正
Other Authors: Chiun-Chieh Hsu
Format: Others
Language:en_US
Published: 2005
Online Access:http://ndltd.ncl.edu.tw/handle/13319360426102780257
Description
Summary:博士 === 國立臺灣科技大學 === 資訊管理系 === 93 === The rapid development of the Internet has resulted in the increased availability of digital documents. The excessive information available on the Internet has caused information overflow. For effective information retrieval, automatic document clustering is a key issue in exploratory text data analysis. Clustering is the unsupervised classification of data items into groups without the need for any preparation of training data. In conventional document clustering methods the dominant approaches to this problem are based on measuring the similarity among the documents. However, the native feature space comprises the terms in documents, which may number tens or hundreds of thousands of terms for even a moderate collection of texts. The numerous features of the documents create confusion regarding the topics of the documents, and the computational complexities of the conventional document clustering methods increase rapidly with the number of the documents. Consequently, many conventional document clustering methods frequently perform poorly for large document collections, and the clustering results frequently differ depending on assumptions and contexts in different text document collection. This study aims at reducing the computational complexity of keyword comparison in conventional document-to-document similarity based clustering methods, while also increasing the clustering accuracy. Hence, reversing the notions and processes of the conventional document clustering methods, this study proposes two unsupervised document clustering methods based on automatic topic keyword clustering. The first method using weighted directed graphs for keyword clustering to resolve the problem of unsupervised non-exclusive document clustering. In this clustering method, the strongly connected components in the keyword digraph are explored heuristically and the documents are clustered with the identified keyword clusters. The second method using weighted undirected graphs for keyword clustering to resolve the unsupervised exclusive document clustering problem. In this clustering method, a topics/events detection scheme is developed and used to extract the most meaningful keywords for document representation and keyword clustering. This process makes the topic keywords more discriminative and concise than using simple feature filtering metrics only and, furthermore, lightens the computational cost. As a result, in the proposed clustering methods, only those discriminative and meaningful topic keywords are used and the topic keywords increase less than the number of documents. Hence, the proposed clustering methods can significantly reduce the computational costs, and simultaneously obtain a high clustering accuracy compared with those obtained by the other clustering methods. The proposed clustering methods are more suitable for clustering large document collections than conventional document clustering methods.