Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic v...

Full description

Bibliographic Details
Main Authors: Mike Kestemont, Jeroen De Gussem
Format: Article
Language:English
Published: Nicolas Turenne 2017-08-01
Series:Journal of Data Mining and Digital Humanities
Subjects:
Online Access:https://jdmdh.episciences.org/3835/pdf
id doaj-82de497163ea46498b39c413059c0c30
record_format Article
spelling doaj-82de497163ea46498b39c413059c0c302021-02-22T16:19:10ZengNicolas TurenneJournal of Data Mining and Digital Humanities2416-59992017-08-01Special Issue on Computer-Aided Processing of Intertextuality in Ancient LanguagesTowards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similaritiesjdmdh:3835Integrated Sequence Tagging for Medieval Latin Using Deep Representation LearningMike KestemontJeroen De GussemIn this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.https://jdmdh.episciences.org/3835/pdfcomputer science - computation and languagecomputer science - learningstatistics - machine learning
collection DOAJ
language English
format Article
sources DOAJ
author Mike Kestemont
Jeroen De Gussem
spellingShingle Mike Kestemont
Jeroen De Gussem
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
Journal of Data Mining and Digital Humanities
computer science - computation and language
computer science - learning
statistics - machine learning
author_facet Mike Kestemont
Jeroen De Gussem
author_sort Mike Kestemont
title Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
title_short Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
title_full Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
title_fullStr Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
title_full_unstemmed Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
title_sort integrated sequence tagging for medieval latin using deep representation learning
publisher Nicolas Turenne
series Journal of Data Mining and Digital Humanities
issn 2416-5999
publishDate 2017-08-01
description In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.
topic computer science - computation and language
computer science - learning
statistics - machine learning
url https://jdmdh.episciences.org/3835/pdf
work_keys_str_mv AT mikekestemont integratedsequencetaggingformedievallatinusingdeeprepresentationlearning
AT jeroendegussem integratedsequencetaggingformedievallatinusingdeeprepresentationlearning
_version_ 1724256521926213632