Collective Entity Identification and Ranking in Biomedical Full Text Literature

博士 === 國立清華大學 === 資訊工程學系 === 100 === Several research results have showed that finding information about certain entities is the most common information needs of information retrieval users. The needs should be answered by returning specific entities, their properties or related entities instead of...

Full description

Bibliographic Details
Main Authors: Dai, ng-Jie, 戴鴻傑
Other Authors: Hsu, n-Lian
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/14199680776469560455
Description
Summary:博士 === 國立清華大學 === 資訊工程學系 === 100 === Several research results have showed that finding information about certain entities is the most common information needs of information retrieval users. The needs should be answered by returning specific entities, their properties or related entities instead of just documents. While some search engines are capable of recognizing specific types of entities, true entity-oriented search still has a long way to go because of the high ambiguity in names across documents. Entity Linking (EL) goes beyond the entity recognition task by linking a textual entity mention to a knowledge base entry. It is a difficult task involving several challenges, including name variation and ambiguity. This dissertation considers identifying the identity of one particular entity type in biomedical articles—gene/gene product mentions as a case study to explore the EL task. Unlike most previous EL-related tasks, this work considers the task from the perspective of the instance-based level and evaluates its performance from an integrated view of the recognition and linking steps. Considering EL tasks from the instance level makes our approach and its evaluation results more relevant to the developers of information extraction applications. The dissertation compiles the first instance-based gene mention linking corpus, which uncovers new challenges that the current EL approaches need to address. A collective EL approach is proposed to deal with those challenges by using not only the contextual information of each individual instance but also relations among them. The experimental results show that the collective EL approach can achieve an F-score of 74.1%, which outperforms the traditional individual classification approach by 1.7%. The collective approach can be extended to exploit the characteristics of different paper sections to further improve the EL performance by an F-score of 1.82% in the full text. In addition, retrieving entities as answers to a query has emerged as a new research field. Here the goal is not to just recognize the names of the entities in documents but rather to get back a ranked list of the relevant entities. The ranking task has wide applications. For example, in the curation process of bibliographic databases, generating a ranked entity list is very important because only a small percentage of entities mentioned in a literature are suitable for indexing. Such a ranked list can facilitate a curator to select suitable entities for curation. In this dissertation, a global ranking framework, which considers the exist relationships between the entities to be ranked, is proposed to improve the performance of conventional entity ranking models. By using the proposed framework, the performance of the local ranking model can be improved by 3.2% using the official evaluation metric of the BioCreAtIvE challenge. In addition, by employing the standard ranking quality measure, NDCG, the dissertation demonstrates that the proposed framework can be cascaded with different local ranking models and still improve their ranking results.