Interpretable Models for Information Extraction

There is an abundance of information being generated constantly, most of it encoded as unstructured text. The information expressed this way, although publicly available, is not directly usable by computer systems because it is not organized according to a data model that could inform us how differe...

Full description

Bibliographic Details
Main Author: Valenzuela Escárcega, Marco Antonio
Other Authors: Surdeanu, Mihai
Language:en_US
Published: The University of Arizona. 2016
Subjects:
Online Access:http://hdl.handle.net/10150/613348
http://arizona.openrepository.com/arizona/handle/10150/613348
id ndltd-arizona.edu-oai-arizona.openrepository.com-10150-613348
record_format oai_dc
spelling ndltd-arizona.edu-oai-arizona.openrepository.com-10150-6133482016-06-18T03:00:58Z Interpretable Models for Information Extraction Valenzuela Escárcega, Marco Antonio Surdeanu, Mihai Morrison, Clayton Efrat, Alon Demir, Emek Surdeanu, Mihai Computer Science information extraction There is an abundance of information being generated constantly, most of it encoded as unstructured text. The information expressed this way, although publicly available, is not directly usable by computer systems because it is not organized according to a data model that could inform us how different data nuggets relate to each other. Information extraction provides a way of scanning unstructured text and extracting structured knowledge suitable for querying and manipulation. Most information extraction research focuses on machine learning approaches that can be considered black boxes when deployed in information extraction systems. We propose a declarative language designed for the information extraction task. It allows the use of syntactic patterns alongside token-based surface patterns that incorporate shallow linguistic features. It captures complex constructs such as nested structures, and complex regular expressions over syntactic patterns for event arguments. We implement a novel information extraction runtime system designed for the compilation and execution of the proposed language. The runtime system has novel features for better declarative support, while preserving practicality. It supports features required for handling natural language, like the preservation of ambiguity and the efficient use of contextual information. It has a modular architecture that allows it to be extended with new functionality, which, together with the language design, provides a powerful framework for the development and research of new ideas for declarative information extraction. We use our language and runtime system to build a biomedical information extraction system. This system is capable of recognizing biological entities (e.g., genes, proteins, protein families, simple chemicals), events over entities (e.g., biochemical reactions), and nested events that take other events as arguments (e.g., catalysis). Additionally, it captures complex natural language phenomena like coreference and hedging. Finally, we propose a rule learning procedure to extract rules from statistical systems trained for information extraction. Rule learning combines the advantages of machine learning with the interpretability of our models. This enables us to train information extraction systems using annotated data that can then be extended and modified by human experts, and in this way accelerate the deployment of new systems that can still be extended or modified by human experts. 2016 text Electronic Dissertation http://hdl.handle.net/10150/613348 http://arizona.openrepository.com/arizona/handle/10150/613348 en_US Copyright © is held by the author. Digital access to this material is made possible by the University Libraries, University of Arizona. Further transmission, reproduction or presentation (such as public display or performance) of protected items is prohibited except with permission of the author. The University of Arizona.
collection NDLTD
language en_US
sources NDLTD
topic Computer Science
information extraction
spellingShingle Computer Science
information extraction
Valenzuela Escárcega, Marco Antonio
Interpretable Models for Information Extraction
description There is an abundance of information being generated constantly, most of it encoded as unstructured text. The information expressed this way, although publicly available, is not directly usable by computer systems because it is not organized according to a data model that could inform us how different data nuggets relate to each other. Information extraction provides a way of scanning unstructured text and extracting structured knowledge suitable for querying and manipulation. Most information extraction research focuses on machine learning approaches that can be considered black boxes when deployed in information extraction systems. We propose a declarative language designed for the information extraction task. It allows the use of syntactic patterns alongside token-based surface patterns that incorporate shallow linguistic features. It captures complex constructs such as nested structures, and complex regular expressions over syntactic patterns for event arguments. We implement a novel information extraction runtime system designed for the compilation and execution of the proposed language. The runtime system has novel features for better declarative support, while preserving practicality. It supports features required for handling natural language, like the preservation of ambiguity and the efficient use of contextual information. It has a modular architecture that allows it to be extended with new functionality, which, together with the language design, provides a powerful framework for the development and research of new ideas for declarative information extraction. We use our language and runtime system to build a biomedical information extraction system. This system is capable of recognizing biological entities (e.g., genes, proteins, protein families, simple chemicals), events over entities (e.g., biochemical reactions), and nested events that take other events as arguments (e.g., catalysis). Additionally, it captures complex natural language phenomena like coreference and hedging. Finally, we propose a rule learning procedure to extract rules from statistical systems trained for information extraction. Rule learning combines the advantages of machine learning with the interpretability of our models. This enables us to train information extraction systems using annotated data that can then be extended and modified by human experts, and in this way accelerate the deployment of new systems that can still be extended or modified by human experts.
author2 Surdeanu, Mihai
author_facet Surdeanu, Mihai
Valenzuela Escárcega, Marco Antonio
author Valenzuela Escárcega, Marco Antonio
author_sort Valenzuela Escárcega, Marco Antonio
title Interpretable Models for Information Extraction
title_short Interpretable Models for Information Extraction
title_full Interpretable Models for Information Extraction
title_fullStr Interpretable Models for Information Extraction
title_full_unstemmed Interpretable Models for Information Extraction
title_sort interpretable models for information extraction
publisher The University of Arizona.
publishDate 2016
url http://hdl.handle.net/10150/613348
http://arizona.openrepository.com/arizona/handle/10150/613348
work_keys_str_mv AT valenzuelaescarcegamarcoantonio interpretablemodelsforinformationextraction
_version_ 1718309102429405184