Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction

Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic w...

Full description

Bibliographic Details
Main Author:	Pettersson, Eva
Format:	Doctoral Thesis
Language:	English
Published:	Uppsala universitet, Institutionen för lingvistik och filologi 2016
Subjects:	NLP for historical text spelling normalisation digital humanities information extraction character-based statistical machine translation SMT Levenshtein edit distance language technology computational linguistics
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269753 http://nbn-resolving.de/urn:isbn:978-91-554-9443-8

id	ndltd-UPSALLA1-oai-DiVA.org-uu-269753
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-2697532016-02-20T05:17:12ZSpelling Normalisation and Linguistic Analysis of Historical Text for Information ExtractionengPettersson, EvaUppsala universitet, Institutionen för lingvistik och filologiUppsala2016NLP for historical textspelling normalisationdigital humanitiesinformation extractioncharacter-based statistical machine translationSMTLevenshtein edit distancelanguage technologycomputational linguisticsHistorical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting. Doctoral thesis, monographinfo:eu-repo/semantics/doctoralThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269753urn:isbn:978-91-554-9443-8Studia Linguistica Upsaliensia, 1652-1366 ; 17application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Doctoral Thesis
sources	NDLTD
topic	NLP for historical text spelling normalisation digital humanities information extraction character-based statistical machine translation SMT Levenshtein edit distance language technology computational linguistics
spellingShingle	NLP for historical text spelling normalisation digital humanities information extraction character-based statistical machine translation SMT Levenshtein edit distance language technology computational linguistics Pettersson, Eva Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
description	Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user. An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text. In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting.
author	Pettersson, Eva
author_facet	Pettersson, Eva
author_sort	Pettersson, Eva
title	Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
title_short	Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
title_full	Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
title_fullStr	Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
title_full_unstemmed	Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
title_sort	spelling normalisation and linguistic analysis of historical text for information extraction
publisher	Uppsala universitet, Institutionen för lingvistik och filologi
publishDate	2016
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-269753 http://nbn-resolving.de/urn:isbn:978-91-554-9443-8
work_keys_str_mv	AT petterssoneva spellingnormalisationandlinguisticanalysisofhistoricaltextforinformationextraction
_version_	1718192465880547328

Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction

Similar Items