The investigation of fuzzy document classification on Internet

碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on...

Full description

Bibliographic Details
Main Authors: Nai-Lung Tsao, 曹乃龍
Other Authors: Pei-Ching Lin
Format: Others
Language:zh-TW
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/96296609148688990400
id ndltd-TW-088TKU00392008
record_format oai_dc
spelling ndltd-TW-088TKU003920082016-01-29T04:19:18Z http://ndltd.ncl.edu.tw/handle/96296609148688990400 The investigation of fuzzy document classification on Internet 模糊自動文件分類在網際網路上的探討 Nai-Lung Tsao 曹乃龍 碩士 淡江大學 資訊工程學系 88 The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on plain text document, not document on the Internet. There are numerous data existing on the Internet. Among which, HTML document takes the main part. There are some significant differences existing between plain text document and HTML document, which make special the text analysis modules. The better application and processing of the information extracted from the HTML document, the higher and more accurate the recall and precision rate. The theme of the research is the analysis and automatic document classification of HTML document. Categories of the collected books on the web site www.amazon.com provide the main data source. A web-spider is used to get the data. The experimental category we adapt from www.amazon.com is Programming of Computer and Internet. It contains four large classes and sixty small classes. We extract 1215 kinds of data. Among them, 615 are used as training data, and the rest are testing data. The approaches of the experiment include: keywords extraction, using of statistical method in calculating the weight of keywords, extracting the information entailed by the HTML tag, the combination of the keyword and HTML information in establishing the feature vector, using Semi-Supervised Fuzzy C-Mean algorithm and training data to train the classifier and using vector space model to calculate the class of the document. Pei-Ching Lin 林丕靜 2000 學位論文 ; thesis 55 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on plain text document, not document on the Internet. There are numerous data existing on the Internet. Among which, HTML document takes the main part. There are some significant differences existing between plain text document and HTML document, which make special the text analysis modules. The better application and processing of the information extracted from the HTML document, the higher and more accurate the recall and precision rate. The theme of the research is the analysis and automatic document classification of HTML document. Categories of the collected books on the web site www.amazon.com provide the main data source. A web-spider is used to get the data. The experimental category we adapt from www.amazon.com is Programming of Computer and Internet. It contains four large classes and sixty small classes. We extract 1215 kinds of data. Among them, 615 are used as training data, and the rest are testing data. The approaches of the experiment include: keywords extraction, using of statistical method in calculating the weight of keywords, extracting the information entailed by the HTML tag, the combination of the keyword and HTML information in establishing the feature vector, using Semi-Supervised Fuzzy C-Mean algorithm and training data to train the classifier and using vector space model to calculate the class of the document.
author2 Pei-Ching Lin
author_facet Pei-Ching Lin
Nai-Lung Tsao
曹乃龍
author Nai-Lung Tsao
曹乃龍
spellingShingle Nai-Lung Tsao
曹乃龍
The investigation of fuzzy document classification on Internet
author_sort Nai-Lung Tsao
title The investigation of fuzzy document classification on Internet
title_short The investigation of fuzzy document classification on Internet
title_full The investigation of fuzzy document classification on Internet
title_fullStr The investigation of fuzzy document classification on Internet
title_full_unstemmed The investigation of fuzzy document classification on Internet
title_sort investigation of fuzzy document classification on internet
publishDate 2000
url http://ndltd.ncl.edu.tw/handle/96296609148688990400
work_keys_str_mv AT nailungtsao theinvestigationoffuzzydocumentclassificationoninternet
AT cáonǎilóng theinvestigationoffuzzydocumentclassificationoninternet
AT nailungtsao móhúzìdòngwénjiànfēnlèizàiwǎngjìwǎnglùshàngdetàntǎo
AT cáonǎilóng móhúzìdòngwénjiànfēnlèizàiwǎngjìwǎnglùshàngdetàntǎo
AT nailungtsao investigationoffuzzydocumentclassificationoninternet
AT cáonǎilóng investigationoffuzzydocumentclassificationoninternet
_version_ 1718169005042171904