A Study of Automatic Text Categorization based on Directional Term Structure

碩士 === 國立中興大學 === 資訊管理學系所 === 98 === In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by h...

Full description

Bibliographic Details
Main Authors: Chia-Chuan Wu, 吳家銓
Other Authors: 沈肇基
Format: Others
Language:en_US
Published: 2010
Online Access:http://ndltd.ncl.edu.tw/handle/25870222133144392967
id ndltd-TW-098NCHU5396002
record_format oai_dc
spelling ndltd-TW-098NCHU53960022015-10-30T04:05:02Z http://ndltd.ncl.edu.tw/handle/25870222133144392967 A Study of Automatic Text Categorization based on Directional Term Structure 運用有向詞彙結構於文件自動分類之研究 Chia-Chuan Wu 吳家銓 碩士 國立中興大學 資訊管理學系所 98 In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by human. And the classifier’s accuracy is also competitive when compared to other accurate classifiers such as SVM and Bayes net, so rule-based classification techniques are very popular. Today’s rule-based techniques identify valuable patterns from training documents to construct classification rules. These techniques did not consider the relationship between terms and paragraphs in documents. Since there must be a dominant topic throughout a document’s content, thus generates certain term structures across paragraphs. Therefore, this study presents a new concept, Meaningful Inner Link Object-MILO, by finding underlying directional term links across paragraphs of document for text categorization. In this study, the process of MILO for text categorization consists of four main procedures. Firstly, feature selection, the purpose is to find representative terms to compose MILO from training documents which have a great quantity of noises terms. Secondly, MILO filtering, through the number of MILOs can be more than ten thousand, to measure MILO’s quality is an important issue, by filtering useless MILOs, the accuracy can be improved. Thirdly, the designing of a scoring model, to correctly classify document, an effective model is needed to accurately assign category to unlabeled document. Finally, classification structures, traditional techniques only use one classifier for classification, while this study presents a hierarchical classification structure to improve accuracy. Summary of our method, firstly, a novel method is presented by observing term’s distribution in document paragraphs to extract MILOs for text categorization. Secondly, an improved method is presented by eliminating noises MILOs and using a hierarchical classification structure. The experimental results of the two methods in this study show competitive performance on famous benchmarks such as Reuters, WebKB and Ohsumed. 沈肇基 2010 學位論文 ; thesis 87 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立中興大學 === 資訊管理學系所 === 98 === In the mainstream research of automatic text categorization, rule-based classifier provides an interesting advantage, the interpretability. Because the rules in the classifier are composed by terms, thus they can be easily understood, modified and maintained by human. And the classifier’s accuracy is also competitive when compared to other accurate classifiers such as SVM and Bayes net, so rule-based classification techniques are very popular. Today’s rule-based techniques identify valuable patterns from training documents to construct classification rules. These techniques did not consider the relationship between terms and paragraphs in documents. Since there must be a dominant topic throughout a document’s content, thus generates certain term structures across paragraphs. Therefore, this study presents a new concept, Meaningful Inner Link Object-MILO, by finding underlying directional term links across paragraphs of document for text categorization. In this study, the process of MILO for text categorization consists of four main procedures. Firstly, feature selection, the purpose is to find representative terms to compose MILO from training documents which have a great quantity of noises terms. Secondly, MILO filtering, through the number of MILOs can be more than ten thousand, to measure MILO’s quality is an important issue, by filtering useless MILOs, the accuracy can be improved. Thirdly, the designing of a scoring model, to correctly classify document, an effective model is needed to accurately assign category to unlabeled document. Finally, classification structures, traditional techniques only use one classifier for classification, while this study presents a hierarchical classification structure to improve accuracy. Summary of our method, firstly, a novel method is presented by observing term’s distribution in document paragraphs to extract MILOs for text categorization. Secondly, an improved method is presented by eliminating noises MILOs and using a hierarchical classification structure. The experimental results of the two methods in this study show competitive performance on famous benchmarks such as Reuters, WebKB and Ohsumed.
author2 沈肇基
author_facet 沈肇基
Chia-Chuan Wu
吳家銓
author Chia-Chuan Wu
吳家銓
spellingShingle Chia-Chuan Wu
吳家銓
A Study of Automatic Text Categorization based on Directional Term Structure
author_sort Chia-Chuan Wu
title A Study of Automatic Text Categorization based on Directional Term Structure
title_short A Study of Automatic Text Categorization based on Directional Term Structure
title_full A Study of Automatic Text Categorization based on Directional Term Structure
title_fullStr A Study of Automatic Text Categorization based on Directional Term Structure
title_full_unstemmed A Study of Automatic Text Categorization based on Directional Term Structure
title_sort study of automatic text categorization based on directional term structure
publishDate 2010
url http://ndltd.ncl.edu.tw/handle/25870222133144392967
work_keys_str_mv AT chiachuanwu astudyofautomatictextcategorizationbasedondirectionaltermstructure
AT wújiāquán astudyofautomatictextcategorizationbasedondirectionaltermstructure
AT chiachuanwu yùnyòngyǒuxiàngcíhuìjiégòuyúwénjiànzìdòngfēnlèizhīyánjiū
AT wújiāquán yùnyòngyǒuxiàngcíhuìjiégòuyúwénjiànzìdòngfēnlèizhīyánjiū
AT chiachuanwu studyofautomatictextcategorizationbasedondirectionaltermstructure
AT wújiāquán studyofautomatictextcategorizationbasedondirectionaltermstructure
_version_ 1718115701692039168