Semantic Indexing of 19th-Century Greek Literature Using 21st-Century Linguistic Resources

Manual classification of works of literature with genre/form concepts is a time-consuming task requiring domain expertise. Building automated systems based on language understanding can help humans to achieve this work faster and more consistently. Towards this direction, we present a case study on...

Full description

Bibliographic Details
Main Authors: Dimitris Dimitriadis, Sofia Zapounidou, Grigorios Tsoumakas
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Sustainability
Subjects:
Online Access:https://www.mdpi.com/2071-1050/13/16/8878
Description
Summary:Manual classification of works of literature with genre/form concepts is a time-consuming task requiring domain expertise. Building automated systems based on language understanding can help humans to achieve this work faster and more consistently. Towards this direction, we present a case study on automatic classification of Greek literature books of the 19th century. The main challenges in this problem are the limited number of literature books and resources of that age and the quality of the source text. We propose an automated classification system based on the Bidirectional Encoder Representations from Transformers (BERT) model trained on books from the 20th and 21st century. We also dealt with BERT’s constraint on the maximum sequence length of the input, leveraging the TextRank algorithm to construct representative sentences or phrases from each book. The results show that BERT trained on recent literature books correctly classifies most of the books of the 19th century despite the disparity between the two collections. Additionally, the TextRank algorithm improves the performance of BERT.
ISSN:2071-1050