Using Google’s Keyword Relation in Multi-Domain Document Classification

博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to man...

Full description

Bibliographic Details
Main Authors:	Ping-I Chen, 陳棅易
Other Authors:	Shi-Jen Lin
Format:	Others
Language:	en_US
Published:	2011
Online Access:	http://ndltd.ncl.edu.tw/handle/21958936930841133981

id	ndltd-TW-100NCU05396003
record_format	oai_dc
spelling	ndltd-TW-100NCU053960032015-10-13T21:22:20Z http://ndltd.ncl.edu.tw/handle/21958936930841133981 Using Google’s Keyword Relation in Multi-Domain Document Classification Google文字關聯在多領域文件分類上的應用 Ping-I Chen 陳棅易博士國立中央大學資訊管理研究所 100 How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management. Shi-Jen Lin 林熙禎 2011 學位論文 ; thesis 95 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.
author2	Shi-Jen Lin
author_facet	Shi-Jen Lin Ping-I Chen 陳棅易
author	Ping-I Chen 陳棅易
spellingShingle	Ping-I Chen 陳棅易 Using Google’s Keyword Relation in Multi-Domain Document Classification
author_sort	Ping-I Chen
title	Using Google’s Keyword Relation in Multi-Domain Document Classification
title_short	Using Google’s Keyword Relation in Multi-Domain Document Classification
title_full	Using Google’s Keyword Relation in Multi-Domain Document Classification
title_fullStr	Using Google’s Keyword Relation in Multi-Domain Document Classification
title_full_unstemmed	Using Google’s Keyword Relation in Multi-Domain Document Classification
title_sort	using google’s keyword relation in multi-domain document classification
publishDate	2011
url	http://ndltd.ncl.edu.tw/handle/21958936930841133981
work_keys_str_mv	AT pingichen usinggoogleskeywordrelationinmultidomaindocumentclassification AT chénbǐngyì usinggoogleskeywordrelationinmultidomaindocumentclassification AT pingichen googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng AT chénbǐngyì googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng
_version_	1718061177792102400

Using Google’s Keyword Relation in Multi-Domain Document Classification

Similar Items