Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin

International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middl...

Full description

Bibliographic Details
Main Author: Thibault Clérice
Format: Article
Language:English
Published: Nicolas Turenne 2020-04-01
Series:Journal of Data Mining and Digital Humanities
Subjects:
Online Access:https://jdmdh.episciences.org/6264/pdf
id doaj-4ea435ffcacc4ae0b656d4b92849cdef
record_format Article
spelling doaj-4ea435ffcacc4ae0b656d4b92849cdef2021-02-22T16:19:10ZengNicolas TurenneJournal of Data Mining and Digital Humanities2416-59992020-04-012020Towards a Digital Ecosystem: NLP. Corpus infrastructure. Methods for Retrieving Texts and Computing Text Similaritiesjdmdh:6264Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latinThibault ClériceInternational audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.https://jdmdh.episciences.org/6264/pdfconvolutional networkscripta continuatokenizationold frenchword segmentation[shs.langue]humanities and social sciences/linguistics[shs.class]humanities and social sciences/classical studies[info]computer science [cs]
collection DOAJ
language English
format Article
sources DOAJ
author Thibault Clérice
spellingShingle Thibault Clérice
Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
Journal of Data Mining and Digital Humanities
convolutional network
scripta continua
tokenization
old french
word segmentation
[shs.langue]humanities and social sciences/linguistics
[shs.class]humanities and social sciences/classical studies
[info]computer science [cs]
author_facet Thibault Clérice
author_sort Thibault Clérice
title Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
title_short Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
title_full Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
title_fullStr Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
title_full_unstemmed Evaluating Deep Learning Methods for Word Segmentation of Scripta Continua Texts in Old French and Latin Évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
title_sort evaluating deep learning methods for word segmentation of scripta continua texts in old french and latin évaluer les méthodes de deep learning pour la segmentation des mots de textes en scripta continua en ancien francais et en latin
publisher Nicolas Turenne
series Journal of Data Mining and Digital Humanities
issn 2416-5999
publishDate 2020-04-01
description International audience Tokenization of modern and old Western European languages seems to be fairly simple, as it stands on the presence mostly of markers such as spaces and punctuation. However, when dealing with old sources like manuscripts written in scripta continua, antiquity epigraphy or Middle Age manuscripts, (1) such markers are mostly absent, (2) spelling variation and rich morphology make dictionary based approaches difficult. Applying convolutional encoding to characters followed by linear categorization to word-boundary or in-word-sequence is shown to be effective at tokenizing such inputs. Additionally, the software is released with a simple interface for tokenizing a corpus or generating a training set.
topic convolutional network
scripta continua
tokenization
old french
word segmentation
[shs.langue]humanities and social sciences/linguistics
[shs.class]humanities and social sciences/classical studies
[info]computer science [cs]
url https://jdmdh.episciences.org/6264/pdf
work_keys_str_mv AT thibaultclerice evaluatingdeeplearningmethodsforwordsegmentationofscriptacontinuatextsinoldfrenchandlatinevaluerlesmethodesdedeeplearningpourlasegmentationdesmotsdetextesenscriptacontinuaenancienfrancaisetenlatin
_version_ 1724256493187891200