The investigation of fuzzy document classification on Internet

碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on...

Full description

Bibliographic Details
Main Authors:	Nai-Lung Tsao, 曹乃龍
Other Authors:	Pei-Ching Lin
Format:	Others
Language:	zh-TW
Published:	2000
Online Access:	http://ndltd.ncl.edu.tw/handle/96296609148688990400

id	ndltd-TW-088TKU00392008
record_format	oai_dc
spelling	ndltd-TW-088TKU003920082016-01-29T04:19:18Z http://ndltd.ncl.edu.tw/handle/96296609148688990400 The investigation of fuzzy document classification on Internet 模糊自動文件分類在網際網路上的探討 Nai-Lung Tsao 曹乃龍碩士淡江大學資訊工程學系 88 The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on plain text document, not document on the Internet. There are numerous data existing on the Internet. Among which, HTML document takes the main part. There are some significant differences existing between plain text document and HTML document, which make special the text analysis modules. The better application and processing of the information extracted from the HTML document, the higher and more accurate the recall and precision rate. The theme of the research is the analysis and automatic document classification of HTML document. Categories of the collected books on the web site www.amazon.com provide the main data source. A web-spider is used to get the data. The experimental category we adapt from www.amazon.com is Programming of Computer and Internet. It contains four large classes and sixty small classes. We extract 1215 kinds of data. Among them, 615 are used as training data, and the rest are testing data. The approaches of the experiment include: keywords extraction, using of statistical method in calculating the weight of keywords, extracting the information entailed by the HTML tag, the combination of the keyword and HTML information in establishing the feature vector, using Semi-Supervised Fuzzy C-Mean algorithm and training data to train the classifier and using vector space model to calculate the class of the document. Pei-Ching Lin 林丕靜 2000 學位論文 ; thesis 55 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 淡江大學 === 資訊工程學系 === 88 === The purpose of this thesis is investigating automatic document classification system on Internet. The researches on document classification have been done more than thirty years, and have generated some productive results. These researches, however, always work on plain text document, not document on the Internet. There are numerous data existing on the Internet. Among which, HTML document takes the main part. There are some significant differences existing between plain text document and HTML document, which make special the text analysis modules. The better application and processing of the information extracted from the HTML document, the higher and more accurate the recall and precision rate. The theme of the research is the analysis and automatic document classification of HTML document. Categories of the collected books on the web site www.amazon.com provide the main data source. A web-spider is used to get the data. The experimental category we adapt from www.amazon.com is Programming of Computer and Internet. It contains four large classes and sixty small classes. We extract 1215 kinds of data. Among them, 615 are used as training data, and the rest are testing data. The approaches of the experiment include: keywords extraction, using of statistical method in calculating the weight of keywords, extracting the information entailed by the HTML tag, the combination of the keyword and HTML information in establishing the feature vector, using Semi-Supervised Fuzzy C-Mean algorithm and training data to train the classifier and using vector space model to calculate the class of the document.
author2	Pei-Ching Lin
author_facet	Pei-Ching Lin Nai-Lung Tsao 曹乃龍
author	Nai-Lung Tsao 曹乃龍
spellingShingle	Nai-Lung Tsao 曹乃龍 The investigation of fuzzy document classification on Internet
author_sort	Nai-Lung Tsao
title	The investigation of fuzzy document classification on Internet
title_short	The investigation of fuzzy document classification on Internet
title_full	The investigation of fuzzy document classification on Internet
title_fullStr	The investigation of fuzzy document classification on Internet
title_full_unstemmed	The investigation of fuzzy document classification on Internet
title_sort	investigation of fuzzy document classification on internet
publishDate	2000
url	http://ndltd.ncl.edu.tw/handle/96296609148688990400
work_keys_str_mv	AT nailungtsao theinvestigationoffuzzydocumentclassificationoninternet AT cáonǎilóng theinvestigationoffuzzydocumentclassificationoninternet AT nailungtsao móhúzìdòngwénjiànfēnlèizàiwǎngjìwǎnglùshàngdetàntǎo AT cáonǎilóng móhúzìdòngwénjiànfēnlèizàiwǎngjìwǎnglùshàngdetàntǎo AT nailungtsao investigationoffuzzydocumentclassificationoninternet AT cáonǎilóng investigationoffuzzydocumentclassificationoninternet
_version_	1718169005042171904

The investigation of fuzzy document classification on Internet

Similar Items