Summary: | 碩士 === 國立雲林科技大學 === 資訊管理系 === 104 === In information retrieval, stop words are words which will be filtered out before or after processing information retrieval data in order to improve efficiency and save storage space. Automatic keyword extraction system based on CKIP system have seven main processes which are symbols conversion, CKIP segmentation, speech merge, delete stop words, synonyms filtered, word weight calculation, and keywords extraction. Whereas, in order to remove stop words, currently is construct Chinese stop words lexicon which need a lot of time to increase and maintain the stop words manually.
The purpose of this study is to develop an automatic update of the Chinese stop word dictionary. To improve the existing Chinese stop words dictionary disadvantage that need to manually establishment and maintenance. This study identifies the Chinese stop words through three ways which are "normalization inverted document frequency", "Entropy" and "Borda count" and four judging criteria: "thesaurus", "speech", "thesaurus intersection with speech," and "thesaurus and speech union". The indicators have "precision", "recall" and "F-measure". Experimental results show that the method “normalization inverted document frequency” perform the best in whether any criteria. While the criterion "thesaurus" is the best performance regardless of methods. In addition, with the increasing amount of data, the value of the evaluation show an downward trend. However, after increasing the Chinese stop words automatically, F1 increased an average of 7 to 8% than original value.
|