Query Optimization for On-Demand Information Extraction Tasks over Text Databases
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstruc...
Main Author: | |
---|---|
Language: | en |
Published: |
2012
|
Subjects: | |
Online Access: | http://hdl.handle.net/10012/6593 |
id |
ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-6593 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-LACETR-oai-collectionscanada.gc.ca-OWTU.10012-65932013-10-04T04:11:22ZFarid, Mina H.2012-03-27T19:55:56Z2012-03-27T19:55:56Z2012-03-27T19:55:56Z2012-03-12http://hdl.handle.net/10012/6593Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries. In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents.enDatabaseQuery OptimizationInformation ExtractionData QualityQuery Optimization for On-Demand Information Extraction Tasks over Text DatabasesThesis or DissertationSchool of Computer ScienceMaster of MathematicsComputer Science |
collection |
NDLTD |
language |
en |
sources |
NDLTD |
topic |
Database Query Optimization Information Extraction Data Quality Computer Science |
spellingShingle |
Database Query Optimization Information Extraction Data Quality Computer Science Farid, Mina H. Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
description |
Many modern applications involve analyzing large amounts of data that comes from unstructured text documents. In its original format, data contains information that, if extracted, can give more insight and help in the decision-making process. The ability to answer structured SQL queries over unstructured data allows for more complex data analysis. Querying unstructured data can be accomplished with the help of information extraction (IE) techniques. The traditional way is by using the Extract-Transform-Load (ETL) approach, which performs all possible extractions over the document corpus and stores the extracted relational results in a data warehouse. Then, the extracted data is queried. The ETL approach produces results that are out of date and causes an explosion in the number of possible relations and attributes to extract. Therefore, new approaches to perform extraction on-the-fly were developed; however, previous efforts relied on specialized extraction operators, or particular IE algorithms, which limited the optimization opportunities of such queries.
In this work, we propose an on-line approach that integrates the engine of the database management system with IE systems using a new type of view called extraction views. Queries on text documents are evaluated using these extraction views, which get populated at query-time with newly extracted data. Our approach enables the optimizer to apply all well-defined optimization techniques. The optimizer selects the best execution plan using a defined cost model that considers a user-defined balance between the cost and quality of extraction, and we explain the trade-off between the two factors. The main contribution is the ability to run on-demand information extraction to consider latest changes in the data, while avoiding unnecessary extraction from irrelevant text documents. |
author |
Farid, Mina H. |
author_facet |
Farid, Mina H. |
author_sort |
Farid, Mina H. |
title |
Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
title_short |
Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
title_full |
Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
title_fullStr |
Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
title_full_unstemmed |
Query Optimization for On-Demand Information Extraction Tasks over Text Databases |
title_sort |
query optimization for on-demand information extraction tasks over text databases |
publishDate |
2012 |
url |
http://hdl.handle.net/10012/6593 |
work_keys_str_mv |
AT faridminah queryoptimizationforondemandinformationextractiontasksovertextdatabases |
_version_ |
1716600822469492736 |