Using Google’s Keyword Relation in Multi-Domain Document Classification
博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to man...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2011
|
Online Access: | http://ndltd.ncl.edu.tw/handle/21958936930841133981 |
id |
ndltd-TW-100NCU05396003 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-100NCU053960032015-10-13T21:22:20Z http://ndltd.ncl.edu.tw/handle/21958936930841133981 Using Google’s Keyword Relation in Multi-Domain Document Classification Google文字關聯在多領域文件分類上的應用 Ping-I Chen 陳棅易 博士 國立中央大學 資訊管理研究所 100 How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management. Shi-Jen Lin 林熙禎 2011 學位論文 ; thesis 95 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立中央大學 === 資訊管理研究所 === 100 === How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.
|
author2 |
Shi-Jen Lin |
author_facet |
Shi-Jen Lin Ping-I Chen 陳棅易 |
author |
Ping-I Chen 陳棅易 |
spellingShingle |
Ping-I Chen 陳棅易 Using Google’s Keyword Relation in Multi-Domain Document Classification |
author_sort |
Ping-I Chen |
title |
Using Google’s Keyword Relation in Multi-Domain Document Classification |
title_short |
Using Google’s Keyword Relation in Multi-Domain Document Classification |
title_full |
Using Google’s Keyword Relation in Multi-Domain Document Classification |
title_fullStr |
Using Google’s Keyword Relation in Multi-Domain Document Classification |
title_full_unstemmed |
Using Google’s Keyword Relation in Multi-Domain Document Classification |
title_sort |
using google’s keyword relation in multi-domain document classification |
publishDate |
2011 |
url |
http://ndltd.ncl.edu.tw/handle/21958936930841133981 |
work_keys_str_mv |
AT pingichen usinggoogleskeywordrelationinmultidomaindocumentclassification AT chénbǐngyì usinggoogleskeywordrelationinmultidomaindocumentclassification AT pingichen googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng AT chénbǐngyì googlewénzìguānliánzàiduōlǐngyùwénjiànfēnlèishàngdeyīngyòng |
_version_ |
1718061177792102400 |