A Study of Automatic Text Categorization based on Directional Term Structure
碩士 === 國立中興大學 === 資訊管理學系所 === 98 === In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by h...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2010
|
Online Access: | http://ndltd.ncl.edu.tw/handle/25870222133144392967 |
id |
ndltd-TW-098NCHU5396002 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-098NCHU53960022015-10-30T04:05:02Z http://ndltd.ncl.edu.tw/handle/25870222133144392967 A Study of Automatic Text Categorization based on Directional Term Structure 運用有向詞彙結構於文件自動分類之研究 Chia-Chuan Wu 吳家銓 碩士 國立中興大學 資訊管理學系所 98 In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by human. And the classifier’s accuracy is also competitive when compared to other accurate classifiers such as SVM and Bayes net, so rule-based classification techniques are very popular. Today’s rule-based techniques identify valuable patterns from training documents to construct classification rules. These techniques did not consider the relationship between terms and paragraphs in documents. Since there must be a dominant topic throughout a document’s content, thus generates certain term structures across paragraphs. Therefore, this study presents a new concept, Meaningful Inner Link Object-MILO, by finding underlying directional term links across paragraphs of document for text categorization. In this study, the process of MILO for text categorization consists of four main procedures. Firstly, feature selection, the purpose is to find representative terms to compose MILO from training documents which have a great quantity of noises terms. Secondly, MILO filtering, through the number of MILOs can be more than ten thousand, to measure MILO’s quality is an important issue, by filtering useless MILOs, the accuracy can be improved. Thirdly, the designing of a scoring model, to correctly classify document, an effective model is needed to accurately assign category to unlabeled document. Finally, classification structures, traditional techniques only use one classifier for classification, while this study presents a hierarchical classification structure to improve accuracy. Summary of our method, firstly, a novel method is presented by observing term’s distribution in document paragraphs to extract MILOs for text categorization. Secondly, an improved method is presented by eliminating noises MILOs and using a hierarchical classification structure. The experimental results of the two methods in this study show competitive performance on famous benchmarks such as Reuters, WebKB and Ohsumed. 沈肇基 2010 學位論文 ; thesis 87 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中興大學 === 資訊管理學系所 === 98 === In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by human. And the classifier’s accuracy is also competitive when compared to other accurate classifiers such as SVM and Bayes net, so rule-based classification techniques are very popular. Today’s rule-based techniques identify valuable patterns from training documents to construct classification rules. These techniques did not consider the relationship between terms and paragraphs in documents. Since there must be a dominant topic throughout a document’s content, thus generates certain term structures across paragraphs. Therefore, this study presents a new concept, Meaningful Inner Link Object-MILO, by finding underlying directional term links across paragraphs of document for text categorization.
In this study, the process of MILO for text categorization consists of four main procedures. Firstly, feature selection, the purpose is to find representative terms to compose MILO from training documents which have a great quantity of noises terms. Secondly, MILO filtering, through the number of MILOs can be more than ten thousand, to measure MILO’s quality is an important issue, by filtering useless MILOs, the accuracy can be improved. Thirdly, the designing of a scoring model, to correctly classify document, an effective model is needed to accurately assign category to unlabeled document. Finally, classification structures, traditional techniques only use one classifier for classification, while this study presents a hierarchical classification structure to improve accuracy.
Summary of our method, firstly, a novel method is presented by observing term’s distribution in document paragraphs to extract MILOs for text categorization. Secondly, an improved method is presented by eliminating noises MILOs and using a hierarchical classification structure. The experimental results of the two methods in this study show competitive performance on famous benchmarks such as Reuters, WebKB and Ohsumed.
|
author2 |
沈肇基 |
author_facet |
沈肇基 Chia-Chuan Wu 吳家銓 |
author |
Chia-Chuan Wu 吳家銓 |
spellingShingle |
Chia-Chuan Wu 吳家銓 A Study of Automatic Text Categorization based on Directional Term Structure |
author_sort |
Chia-Chuan Wu |
title |
A Study of Automatic Text Categorization based on Directional Term Structure |
title_short |
A Study of Automatic Text Categorization based on Directional Term Structure |
title_full |
A Study of Automatic Text Categorization based on Directional Term Structure |
title_fullStr |
A Study of Automatic Text Categorization based on Directional Term Structure |
title_full_unstemmed |
A Study of Automatic Text Categorization based on Directional Term Structure |
title_sort |
study of automatic text categorization based on directional term structure |
publishDate |
2010 |
url |
http://ndltd.ncl.edu.tw/handle/25870222133144392967 |
work_keys_str_mv |
AT chiachuanwu astudyofautomatictextcategorizationbasedondirectionaltermstructure AT wújiāquán astudyofautomatictextcategorizationbasedondirectionaltermstructure AT chiachuanwu yùnyòngyǒuxiàngcíhuìjiégòuyúwénjiànzìdòngfēnlèizhīyánjiū AT wújiāquán yùnyòngyǒuxiàngcíhuìjiégòuyúwénjiànzìdòngfēnlèizhīyánjiū AT chiachuanwu studyofautomatictextcategorizationbasedondirectionaltermstructure AT wújiāquán studyofautomatictextcategorizationbasedondirectionaltermstructure |
_version_ |
1718115701692039168 |