Models for predicting the inflectional paradigm of Croatian words

Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must al...

Full description

Bibliographic Details
Main Author: Jan Šnajder
Format: Article
Language:English
Published: Znanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts) 2013-12-01
Series:Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:
Online Access:http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf
id doaj-551d6bbf10e44d5c994ee309ea9a31c5
record_format Article
spelling doaj-551d6bbf10e44d5c994ee309ea9a31c52021-04-02T07:02:16ZengZnanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362013-12-0112134Models for predicting the inflectional paradigm of Croatian wordsJan Šnajder0Faculty of Electrical Engineering and Computing, ZagrebMorphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research. http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdfcomputational morphologyparadigm predictionmachine learningfeature selection
collection DOAJ
language English
format Article
sources DOAJ
author Jan Šnajder
spellingShingle Jan Šnajder
Models for predicting the inflectional paradigm of Croatian words
Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
computational morphology
paradigm prediction
machine learning
feature selection
author_facet Jan Šnajder
author_sort Jan Šnajder
title Models for predicting the inflectional paradigm of Croatian words
title_short Models for predicting the inflectional paradigm of Croatian words
title_full Models for predicting the inflectional paradigm of Croatian words
title_fullStr Models for predicting the inflectional paradigm of Croatian words
title_full_unstemmed Models for predicting the inflectional paradigm of Croatian words
title_sort models for predicting the inflectional paradigm of croatian words
publisher Znanstvena založba Filozofske fakultete Univerze v Ljubljani (Ljubljana University Press, Faculty of Arts)
series Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
issn 2335-2736
publishDate 2013-12-01
description Morphological analysis is a prerequisite for many natural language processing tasks. For inflectionally rich languages such as Croatian, morphological analysis typically relies on a morphological lexicon, which lists the lemmas and their paradigms. However, a real-life morphological analyzer must also be able to handle properly the out-of-vocabulary words. We address the task of predicting the correct inflectional paradigm of unknown Croatian words. We frame this as a supervised machine learning problem: we train a classifier to predict whether a candidate lemma-paradigm pair is correct based on a number of string- and corpus-based features. The candidate lemma-paradigm pairs are generated using a handcrafted morphology grammar. Our aim is to examine the machine learning aspect of the problem: we test a comprehensive set of features and evaluate the classification accuracy using different feature subsets. We show that satisfactory classification accuracy (92%) can be achieved with SVM using a combination of string- and corpus-based features. On a per word basis, the F1-score is 53% and accuracy is 70%, which outperforms a frequency-based baseline by a wide margin. We discuss a number of possible directions for future research.
topic computational morphology
paradigm prediction
machine learning
feature selection
url http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_02.pdf
work_keys_str_mv AT jansnajder modelsforpredictingtheinflectionalparadigmofcroatianwords
_version_ 1724171537711366144