A Keyword Extraction Algorithm for Single Chinese Document

碩士 === 國立政治大學 === 資訊科學學系 === 102 === In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyw...

Full description

Bibliographic Details
Main Authors: Wu, Tai Hsun, 吳泰勳
Other Authors: Hsu, Kuo Wei
Format: Others
Language:zh-TW
Online Access:http://ndltd.ncl.edu.tw/handle/71876280732776370997
id ndltd-TW-102NCCU5394011
record_format oai_dc
spelling ndltd-TW-102NCCU53940112015-10-13T23:10:18Z http://ndltd.ncl.edu.tw/handle/71876280732776370997 A Keyword Extraction Algorithm for Single Chinese Document 一個對單篇中文文章擷取關鍵字之演算法 Wu, Tai Hsun 吳泰勳 碩士 國立政治大學 資訊科學學系 102 In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary. Hsu, Kuo Wei 徐國偉 學位論文 ; thesis 38 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立政治大學 === 資訊科學學系 === 102 === In the past 14 years, Taiwan e-Learning and Digital Archives Program has developed digital archives of organism, archaeology, geology, etc. There are 15 topics in the digital archives. The goal of the work presented in this thesis is to automatically extract keyword s in documents in digital archives, and the techniques developed along with the work can be used to build a connection between digital archives and news articles. Because there are always new words or new uses of words in news articles, in this thesis we propose an algorithm that can automatically extract keywords from a single Chinese document without using a corpus or dictionary. Given a document in Chinese, initially the algorithm uses a bigram-based approach to divide it into bigrams of Chinese characters. Next, the algorithm calculates term frequencies of bigrams and filters out those with low term frequencies. Finally, the algorithm calculates chi-square values to produce keywords that are most related to the topic of the given document. The co-occurrence of words can be used as an indicator for the degree of importance of words. If a term and some frequent terms have similar distributions of co-occurrence, it would probably be a keyword. Unlike English word segmentation which can be done by using word delimiters, Chinese word segmentation has been a challenging task because there are no spaces between characters in Chinese. The proposed algorithm performs Chinese word segmentation by using a bigram-based approach, and we compare the segmented words with those given by CKIP and Stanford Chinese Segmenter. In this thesis, we present comparisons for different settings: One considers whether or not infrequent terms are filtered out, and the other considers whether or not frequent terms are clustered by a clustering algorithm. The dataset used in experiments is downloaded from the Academia Sinica Digital Resources and the ground truth is provided by Gainwisdom, which is developed by Computer Systems and Communication Lab in Academia Sinica. According to the experimental results, some of the segmented words given by the bigram-based approach adopted in the proposed algorithm are the same as those given by CKIP or Stanford Chinese Segmenter, while some of the segmented words given by the bigram-based approach have stronger connections to topics of documents. The main advantage of the bigram-based approach is that it does not require a corpus or dictionary.
author2 Hsu, Kuo Wei
author_facet Hsu, Kuo Wei
Wu, Tai Hsun
吳泰勳
author Wu, Tai Hsun
吳泰勳
spellingShingle Wu, Tai Hsun
吳泰勳
A Keyword Extraction Algorithm for Single Chinese Document
author_sort Wu, Tai Hsun
title A Keyword Extraction Algorithm for Single Chinese Document
title_short A Keyword Extraction Algorithm for Single Chinese Document
title_full A Keyword Extraction Algorithm for Single Chinese Document
title_fullStr A Keyword Extraction Algorithm for Single Chinese Document
title_full_unstemmed A Keyword Extraction Algorithm for Single Chinese Document
title_sort keyword extraction algorithm for single chinese document
url http://ndltd.ncl.edu.tw/handle/71876280732776370997
work_keys_str_mv AT wutaihsun akeywordextractionalgorithmforsinglechinesedocument
AT wútàixūn akeywordextractionalgorithmforsinglechinesedocument
AT wutaihsun yīgèduìdānpiānzhōngwénwénzhāngxiéqǔguānjiànzìzhīyǎnsuànfǎ
AT wútàixūn yīgèduìdānpiānzhōngwénwénzhāngxiéqǔguānjiànzìzhīyǎnsuànfǎ
AT wutaihsun keywordextractionalgorithmforsinglechinesedocument
AT wútàixūn keywordextractionalgorithmforsinglechinesedocument
_version_ 1718084604494086144