Cluster-based Categorization for Chinese News Articles
碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2003
|
Online Access: | http://ndltd.ncl.edu.tw/handle/05058007535814690178 |
id |
ndltd-TW-091MCU00396002 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-091MCU003960022015-10-13T17:01:35Z http://ndltd.ncl.edu.tw/handle/05058007535814690178 Cluster-based Categorization for Chinese News Articles 叢聚式中文新聞分類 吳毓傑 碩士 銘傳大學 資訊管理研究所 91 Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters. Jennan Chen 陳振南 2003 學位論文 ; thesis 46 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 銘傳大學 === 資訊管理研究所 === 91 === Text categorization is the process of distributing text documents into one or more predefined categories or classes of similar documents. There are two aspects of representing a document. Both word-based and class-based representations of document have been used in the literature. However most common word-based approaches used in vector space model refer to use feature selection on large text data. The crux of the word-based problem is high dimensional space and sparse data distribution. In recent years, attention has been shifted from word-based towards class-based representation, in the hope of providing broader coverage for unrestricted text. In this thesis, we used bisecting K-means to cluster terms. In order to increase the quality of bisecting K-means, we employed the K-means algorithm to refine each cluster. Thus, some small clusters could be assigned to other related clusters. Our test collection consists of lots of Chinese newswire articles to be our experimental materials, and the support vector machine was used to classify test data. Our experimental results demonstrate that the accuracy of bisecting K-means algorithm outperformed (about 10%) the hierarchical cluster. In order to control the number of clusters, we made use of a dis-similarity analysis to find the appropriate number of clusters to represent the dimensionality for each document. Based on this tuning approach, we can automatically obtain suitable number of clusters.
|
author2 |
Jennan Chen |
author_facet |
Jennan Chen 吳毓傑 |
author |
吳毓傑 |
spellingShingle |
吳毓傑 Cluster-based Categorization for Chinese News Articles |
author_sort |
吳毓傑 |
title |
Cluster-based Categorization for Chinese News Articles |
title_short |
Cluster-based Categorization for Chinese News Articles |
title_full |
Cluster-based Categorization for Chinese News Articles |
title_fullStr |
Cluster-based Categorization for Chinese News Articles |
title_full_unstemmed |
Cluster-based Categorization for Chinese News Articles |
title_sort |
cluster-based categorization for chinese news articles |
publishDate |
2003 |
url |
http://ndltd.ncl.edu.tw/handle/05058007535814690178 |
work_keys_str_mv |
AT wúyùjié clusterbasedcategorizationforchinesenewsarticles AT wúyùjié cóngjùshìzhōngwénxīnwénfēnlèi |
_version_ |
1717778650683670528 |