The investigation of fuzzy document classification on Internet

碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on...

Full description

Bibliographic Details
Main Authors: Nai-Lung Tsao, 曹乃龍
Other Authors: Pei-Ching Lin
Format: Others
Language:zh-TW
Published: 2000
Online Access:http://ndltd.ncl.edu.tw/handle/96296609148688990400
Description
Summary:碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on plain text document, not document on the Internet. There are numerous data existing on the Internet. Among which, HTML document takes the main part. There are some significant differences existing between plain text document and HTML document, which make special the text analysis modules. The better application and processing of the information extracted from the HTML document, the higher and more accurate the recall and precision rate. The theme of the research is the analysis and automatic document classification of HTML document. Categories of the collected books on the web site www.amazon.com provide the main data source. A web-spider is used to get the data. The experimental category we adapt from www.amazon.com is Programming of Computer and Internet. It contains four large classes and sixty small classes. We extract 1215 kinds of data. Among them, 615 are used as training data, and the rest are testing data. The approaches of the experiment include: keywords extraction, using of statistical method in calculating the weight of keywords, extracting the information entailed by the HTML tag, the combination of the keyword and HTML information in establishing the feature vector, using Semi-Supervised Fuzzy C-Mean algorithm and training data to train the classifier and using vector space model to calculate the class of the document.