Summary: | 碩士 === 國立臺灣師範大學 === 應用電子科技研究所 === 94 === Chinese word segmentation in a Chinese sentence is an essential step in the processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is the most widely used method, which can correctly identify Chinese sentences as distinct words from Chinese-language texts for real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by the lexicon-based Chinese word segmentation systems with a constant lexicon. Moreover, to maintain the lexicon by manpower is an inefficient and time-consuming job. Based on the problems, this study proposes a novel statistics-based scheme for new word extraction based on the categorized corpuses of Google news retrieved from the Google news site automatically to promote the word identification ability for the lexicon-based Chinese word segmentation systems. Compared with another proposed method, the experimental results indicated that the proposed new word extraction scheme not only can more correctly retrieve news words from the categorized corpuses of Google news, but also obtain has larger amount of new words.
|