A Keyword Extraction Algorithm for Single Chinese Document
碩士 === 國立政治大學 === 資訊科學學系 === 102 === In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyw...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Online Access: | http://ndltd.ncl.edu.tw/handle/71876280732776370997 |
id |
ndltd-TW-102NCCU5394011 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-102NCCU53940112015-10-13T23:10:18Z http://ndltd.ncl.edu.tw/handle/71876280732776370997 A Keyword Extraction Algorithm for Single Chinese Document 一個對單篇中文文章擷取關鍵字之演算法 Wu, Tai Hsun 吳泰勳 碩士 國立政治大學 資訊科學學系 102 In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary. Hsu, Kuo Wei 徐國偉 學位論文 ; thesis 38 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立政治大學 === 資訊科學學系 === 102 === In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.
|
author2 |
Hsu, Kuo Wei |
author_facet |
Hsu, Kuo Wei Wu, Tai Hsun 吳泰勳 |
author |
Wu, Tai Hsun 吳泰勳 |
spellingShingle |
Wu, Tai Hsun 吳泰勳 A Keyword Extraction Algorithm for Single Chinese Document |
author_sort |
Wu, Tai Hsun |
title |
A Keyword Extraction Algorithm for Single Chinese Document |
title_short |
A Keyword Extraction Algorithm for Single Chinese Document |
title_full |
A Keyword Extraction Algorithm for Single Chinese Document |
title_fullStr |
A Keyword Extraction Algorithm for Single Chinese Document |
title_full_unstemmed |
A Keyword Extraction Algorithm for Single Chinese Document |
title_sort |
keyword extraction algorithm for single chinese document |
url |
http://ndltd.ncl.edu.tw/handle/71876280732776370997 |
work_keys_str_mv |
AT wutaihsun akeywordextractionalgorithmforsinglechinesedocument AT wútàixūn akeywordextractionalgorithmforsinglechinesedocument AT wutaihsun yīgèduìdānpiānzhōngwénwénzhāngxiéqǔguānjiànzìzhīyǎnsuànfǎ AT wútàixūn yīgèduìdānpiānzhōngwénwénzhāngxiéqǔguānjiànzìzhīyǎnsuànfǎ AT wutaihsun keywordextractionalgorithmforsinglechinesedocument AT wútàixūn keywordextractionalgorithmforsinglechinesedocument |
_version_ |
1718084604494086144 |