Cluster-based Categorization for Chinese News Articles

碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in...

Full description

Bibliographic Details
Main Author:	吳毓傑
Other Authors:	Jennan Chen
Format:	Others
Language:	zh-TW
Published:	2003
Online Access:	http://ndltd.ncl.edu.tw/handle/05058007535814690178

id	ndltd-TW-091MCU00396002
record_format	oai_dc
spelling	ndltd-TW-091MCU003960022015-10-13T17:01:35Z http://ndltd.ncl.edu.tw/handle/05058007535814690178 Cluster-based Categorization for Chinese News Articles 叢聚式中文新聞分類吳毓傑碩士銘傳大學資訊管理研究所 91 Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters. Jennan Chen 陳振南 2003 學位論文 ; thesis 46 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters.
author2	Jennan Chen
author_facet	Jennan Chen 吳毓傑
author	吳毓傑
spellingShingle	吳毓傑 Cluster-based Categorization for Chinese News Articles
author_sort	吳毓傑
title	Cluster-based Categorization for Chinese News Articles
title_short	Cluster-based Categorization for Chinese News Articles
title_full	Cluster-based Categorization for Chinese News Articles
title_fullStr	Cluster-based Categorization for Chinese News Articles
title_full_unstemmed	Cluster-based Categorization for Chinese News Articles
title_sort	cluster-based categorization for chinese news articles
publishDate	2003
url	http://ndltd.ncl.edu.tw/handle/05058007535814690178
work_keys_str_mv	AT wúyùjié clusterbasedcategorizationforchinesenewsarticles AT wúyùjié cóngjùshìzhōngwénxīnwénfēnlèi
_version_	1717778650683670528

Cluster-based Categorization for Chinese News Articles

Similar Items