Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middl...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Nicolas Turenne
2020-04-01
|
Series: | Journal of Data Mining and Digital Humanities |
Subjects: | |
Online Access: | https://jdmdh.episciences.org/6264/pdf |
id |
doaj-4ea435ffcacc4ae0b656d4b92849cdef |
---|---|
record_format |
Article |
spelling |
doaj-4ea435ffcacc4ae0b656d4b92849cdef2021-02-22T16:19:10ZengNicolas TurenneJournal of Data Mining and Digital Humanities2416-59992020-04-012020Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similaritiesjdmdh:6264Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latinThibault ClériceInternational audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.https://jdmdh.episciences.org/6264/pdfconvolutional networkscripta continuatokenizationold frenchword segmentation[shs.langue]humanities and social sciences/linguistics[shs.class]humanities and social sciences/classical studies[info]computer science [cs] |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Thibault Clérice |
spellingShingle |
Thibault Clérice Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin Journal of Data Mining and Digital Humanities convolutional network scripta continua tokenization old french word segmentation [shs.langue]humanities and social sciences/linguistics [shs.class]humanities and social sciences/classical studies [info]computer science [cs] |
author_facet |
Thibault Clérice |
author_sort |
Thibault Clérice |
title |
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
title_short |
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
title_full |
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
title_fullStr |
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
title_full_unstemmed |
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
title_sort |
evaluating deep learning methods for word segmentation of scripta continua texts in old french and latin évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin |
publisher |
Nicolas Turenne |
series |
Journal of Data Mining and Digital Humanities |
issn |
2416-5999 |
publishDate |
2020-04-01 |
description |
International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set. |
topic |
convolutional network scripta continua tokenization old french word segmentation [shs.langue]humanities and social sciences/linguistics [shs.class]humanities and social sciences/classical studies [info]computer science [cs] |
url |
https://jdmdh.episciences.org/6264/pdf |
work_keys_str_mv |
AT thibaultclerice evaluatingdeeplearningmethodsforwordsegmentationofscriptacontinuatextsinoldfrenchandlatinevaluerlesmethodesdedeeplearningpourlasegmentationdesmotsdetextesenscriptacontinuaenancienfrancaisetenlatin |
_version_ |
1724256493187891200 |