A Green Form-Based Information Extraction System for Historical Documents

Many historical documents are rich in genealogical facts. Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online. As one approach for helping to make the extraction feasible, we...

Full description

Bibliographic Details
Main Author: Kim, Tae Woo
Format: Others
Published: BYU ScholarsArchive 2017
Subjects:
Online Access:https://scholarsarchive.byu.edu/etd/6375
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=7375&context=etd
id ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-7375
record_format oai_dc
spelling ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-73752019-05-16T03:31:32Z A Green Form-Based Information Extraction System for Historical Documents Kim, Tae Woo Many historical documents are rich in genealogical facts. Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online. As one approach for helping to make the extraction feasible, we propose GreenFIE—a "Green" Form-based Information-Extraction tool which is "green" in the sense that it improves with use toward the goal of minimizing the cost of human labor while maintaining high extraction accuracy. Given a page in a historical document, the user's task is to fill out given forms with all facts on a page in a document called for by the forms (e.g. to collect the birth and death information, marriage information, and parent-child relationships for each person on the page). GreenFIE has a repository of extraction patterns that it applies to fill in forms. A user checks the correctness of GreenFIE's form filling, adds any missed facts, and fixes any mistakes. GreenFIE learns based on user feedback, adding new extraction rules to its repository. Ideally, GreenFIE improves as it proceeds so that it does most of the work, leaving little for the user to do other than confirm that its extraction is correct. We evaluate how well GreenFIE performs on family history books in terms of "greenness"—how much human labor diminishes during form filling, while simultaneously maintaining high accuracy. 2017-05-01T07:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/6375 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=7375&context=etd http://lib.byu.edu/about/copyright/ All Theses and Dissertations BYU ScholarsArchive green systems self-improving systems data extraction regular-expression generation Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic green systems
self-improving systems
data extraction
regular-expression generation
Computer Sciences
spellingShingle green systems
self-improving systems
data extraction
regular-expression generation
Computer Sciences
Kim, Tae Woo
A Green Form-Based Information Extraction System for Historical Documents
description Many historical documents are rich in genealogical facts. Extracting these facts by hand is tedious and almost impossible considering the hundreds of thousands of genealogically rich family-history books currently scanned and online. As one approach for helping to make the extraction feasible, we propose GreenFIE—a "Green" Form-based Information-Extraction tool which is "green" in the sense that it improves with use toward the goal of minimizing the cost of human labor while maintaining high extraction accuracy. Given a page in a historical document, the user's task is to fill out given forms with all facts on a page in a document called for by the forms (e.g. to collect the birth and death information, marriage information, and parent-child relationships for each person on the page). GreenFIE has a repository of extraction patterns that it applies to fill in forms. A user checks the correctness of GreenFIE's form filling, adds any missed facts, and fixes any mistakes. GreenFIE learns based on user feedback, adding new extraction rules to its repository. Ideally, GreenFIE improves as it proceeds so that it does most of the work, leaving little for the user to do other than confirm that its extraction is correct. We evaluate how well GreenFIE performs on family history books in terms of "greenness"—how much human labor diminishes during form filling, while simultaneously maintaining high accuracy.
author Kim, Tae Woo
author_facet Kim, Tae Woo
author_sort Kim, Tae Woo
title A Green Form-Based Information Extraction System for Historical Documents
title_short A Green Form-Based Information Extraction System for Historical Documents
title_full A Green Form-Based Information Extraction System for Historical Documents
title_fullStr A Green Form-Based Information Extraction System for Historical Documents
title_full_unstemmed A Green Form-Based Information Extraction System for Historical Documents
title_sort green form-based information extraction system for historical documents
publisher BYU ScholarsArchive
publishDate 2017
url https://scholarsarchive.byu.edu/etd/6375
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=7375&context=etd
work_keys_str_mv AT kimtaewoo agreenformbasedinformationextractionsystemforhistoricaldocuments
AT kimtaewoo greenformbasedinformationextractionsystemforhistoricaldocuments
_version_ 1719186841696993280