Using Google’s Keyword Relation in Multi-Domain Document Classification

博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to man...

Full description

Bibliographic Details
Main Authors: Ping-I Chen, 陳棅易
Other Authors: Shi-Jen Lin
Format: Others
Language:en_US
Published: 2011
Online Access:http://ndltd.ncl.edu.tw/handle/21958936930841133981
id ndltd-TW-100NCU05396003
record_format oai_dc
spelling ndltd-TW-100NCU053960032015-10-13T21:22:20Z http://ndltd.ncl.edu.tw/handle/21958936930841133981 Using Google’s Keyword Relation in Multi-Domain Document Classification Google文字關聯在多領域文件分類上的應用 Ping-I Chen 陳棅易 博士 國立中央大學 資訊管理研究所 100 How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management. Shi-Jen Lin 林熙禎 2011 學位論文 ; thesis 95 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.
author2 Shi-Jen Lin
author_facet Shi-Jen Lin
Ping-I Chen
陳棅易
author Ping-I Chen
陳棅易
spellingShingle Ping-I Chen
陳棅易
Using Google’s Keyword Relation in Multi-Domain Document Classification
author_sort Ping-I Chen
title Using Google’s Keyword Relation in Multi-Domain Document Classification
title_short Using Google’s Keyword Relation in Multi-Domain Document Classification
title_full Using Google’s Keyword Relation in Multi-Domain Document Classification
title_fullStr Using Google’s Keyword Relation in Multi-Domain Document Classification
title_full_unstemmed Using Google’s Keyword Relation in Multi-Domain Document Classification
title_sort using google’s keyword relation in multi-domain document classification
publishDate 2011
url http://ndltd.ncl.edu.tw/handle/21958936930841133981
work_keys_str_mv AT pingichen usinggoogleskeywordrelationinmultidomaindocumentclassification
AT chénbǐngyì usinggoogleskeywordrelationinmultidomaindocumentclassification
AT pingichen googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng
AT chénbǐngyì googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng
_version_ 1718061177792102400