Cluster-based Categorization for Chinese News Articles

碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in...

Full description

Bibliographic Details
Main Author: 吳毓傑
Other Authors: Jennan Chen
Format: Others
Language:zh-TW
Published: 2003
Online Access:http://ndltd.ncl.edu.tw/handle/05058007535814690178
id ndltd-TW-091MCU00396002
record_format oai_dc
spelling ndltd-TW-091MCU003960022015-10-13T17:01:35Z http://ndltd.ncl.edu.tw/handle/05058007535814690178 Cluster-based Categorization for Chinese News Articles 叢聚式中文新聞分類 吳毓傑 碩士 銘傳大學 資訊管理研究所 91 Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters. Jennan Chen 陳振南 2003 學位論文 ; thesis 46 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters.
author2 Jennan Chen
author_facet Jennan Chen
吳毓傑
author 吳毓傑
spellingShingle 吳毓傑
Cluster-based Categorization for Chinese News Articles
author_sort 吳毓傑
title Cluster-based Categorization for Chinese News Articles
title_short Cluster-based Categorization for Chinese News Articles
title_full Cluster-based Categorization for Chinese News Articles
title_fullStr Cluster-based Categorization for Chinese News Articles
title_full_unstemmed Cluster-based Categorization for Chinese News Articles
title_sort cluster-based categorization for chinese news articles
publishDate 2003
url http://ndltd.ncl.edu.tw/handle/05058007535814690178
work_keys_str_mv AT wúyùjié clusterbasedcategorizationforchinesenewsarticles
AT wúyùjié cóngjùshìzhōngwénxīnwénfēnlèi
_version_ 1717778650683670528