Summary: | 碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters.
|