Summary: | 碩士 === 國立暨南國際大學 === 資訊工程學系 === 93 === Recognizing biological named entities is a basic and important problem for information extraction systems while automatically extracting related information like proteins and genes from biomedical literatures. In this thesis, to extract protein-related information from literatures, we propose a novel method to recognize protein names based on heuristics and mining association concepts. Partial name fragments (tokens) of proteins can be efficiently detected by heuristic rules that indicate morphological features of protein names. However, the exact boundary for the protein name is hard to be determined based on these rules. By regarding protein name tokens as items, we apply mining associations to discover significant sequential patterns (SSPs) from protein name dictionaries, in which each dictionary record is regarded as a transaction and the dictionary is corresponding to the transaction database. Consequently, a SSP consists of protein name tokens that tend to appear together in protein names in a specific order. According to SSPs, our protein name recognition system (PNRS) is able to extract the exact protein name by extending protein name tokens detected by heuristic rules. Based on Yapex101 corpus, the experiment result shows that the F-measure of PNRS is 68.5% that is slightly better than systems developed by Franzen and Seki
|