Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning
In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic v...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nicolas Turenne
2017-08-01
|
Series: | Journal of Data Mining and Digital Humanities |
Subjects: | |
Online Access: | https://jdmdh.episciences.org/3835/pdf |
id |
doaj-82de497163ea46498b39c413059c0c30 |
---|---|
record_format |
Article |
spelling |
doaj-82de497163ea46498b39c413059c0c302021-02-22T16:19:10ZengNicolas TurenneJournal of Data Mining and Digital Humanities2416-59992017-08-01Special Issue on Computer-Aided Processing of Intertextuality in Ancient LanguagesTowards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similaritiesjdmdh:3835Integrated Sequence Tagging for Medieval Latin Using Deep Representation LearningMike KestemontJeroen De GussemIn this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.https://jdmdh.episciences.org/3835/pdfcomputer science - computation and languagecomputer science - learningstatistics - machine learning |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Mike Kestemont Jeroen De Gussem |
spellingShingle |
Mike Kestemont Jeroen De Gussem Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning Journal of Data Mining and Digital Humanities computer science - computation and language computer science - learning statistics - machine learning |
author_facet |
Mike Kestemont Jeroen De Gussem |
author_sort |
Mike Kestemont |
title |
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning |
title_short |
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning |
title_full |
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning |
title_fullStr |
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning |
title_full_unstemmed |
Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning |
title_sort |
integrated sequence tagging for medieval latin using deep representation learning |
publisher |
Nicolas Turenne |
series |
Journal of Data Mining and Digital Humanities |
issn |
2416-5999 |
publishDate |
2017-08-01 |
description |
In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning. |
topic |
computer science - computation and language computer science - learning statistics - machine learning |
url |
https://jdmdh.episciences.org/3835/pdf |
work_keys_str_mv |
AT mikekestemont integratedsequencetaggingformedievallatinusingdeeprepresentationlearning AT jeroendegussem integratedsequencetaggingformedievallatinusingdeeprepresentationlearning |
_version_ |
1724256521926213632 |