Models for predicting the inflectional paradigm of Croatian words
Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must al...
Main Author: | |
---|---|
Format: | Article |
Language: | English |
Published: |
Znanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts)
2013-12-01
|
Series: | Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
Subjects: | |
Online Access: | http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf |
id |
doaj-551d6bbf10e44d5c994ee309ea9a31c5 |
---|---|
record_format |
Article |
spelling |
doaj-551d6bbf10e44d5c994ee309ea9a31c52021-04-02T07:02:16ZengZnanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362013-12-0112134Models for predicting the inflectional paradigm of Croatian wordsJan Šnajder0Faculty of Electrical Engineering and Computing, ZagrebMorphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research. http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdfcomputational morphologyparadigm predictionmachine learningfeature selection |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jan Šnajder |
spellingShingle |
Jan Šnajder Models for predicting the inflectional paradigm of Croatian words Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave computational morphology paradigm prediction machine learning feature selection |
author_facet |
Jan Šnajder |
author_sort |
Jan Šnajder |
title |
Models for predicting the inflectional paradigm of Croatian words |
title_short |
Models for predicting the inflectional paradigm of Croatian words |
title_full |
Models for predicting the inflectional paradigm of Croatian words |
title_fullStr |
Models for predicting the inflectional paradigm of Croatian words |
title_full_unstemmed |
Models for predicting the inflectional paradigm of Croatian words |
title_sort |
models for predicting the inflectional paradigm of croatian words |
publisher |
Znanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts) |
series |
Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave |
issn |
2335-2736 |
publishDate |
2013-12-01 |
description |
Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research. |
topic |
computational morphology paradigm prediction machine learning feature selection |
url |
http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf |
work_keys_str_mv |
AT jansnajder modelsforpredictingtheinflectionalparadigmofcroatianwords |
_version_ |
1724171537711366144 |