Summary: | 碩士 === 國立雲林科技大學 === 資訊管理系碩士班 === 93 === The form of abbreviation is commonly used in the Chinese text. For instance, we often transform ‘台灣鐵路局’ into ‘台鐵局’. This kind of transformation is timesaving and convenient. However, this merit also brings some challenges in Chinese text processing. In keyword-based information retrieval system, using the abbreviated form and the original form as search entry respectively, usually return different results even though both are the same meaning. In addition, the influences of abbreviation on Chinese word segmentation, automatic documents clustering and weight of terms are obvious.
To solve the semantic ambiguity problem, we propose an approach to connect the two forms and construct an abbreviation list automatically in corpus without any fixed dictionary.
In this study, we conduct three major experiments with 8,500 documents from news website. Each experiment is a duo-process, from original form to abbreviation form back and forth. In the first experiment, we employ Maximum Entropy Model which uses many contextual “features” to locate the best candidate. In the second experiment, we attempt to transform original forms from their abbreviations. The third experiment is aimed at finding abbreviations from their original forms. The precision ratios achieve 80%-90%, 70%, and 80% respectively.
|