A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use spac...

Full description

Bibliographic Details
Main Authors: M. Zahedi, A. Arjomandzadeh
Format: Article
Language:English
Published: Shahrood University of Technology 2016-01-01
Series:Journal of Artificial Intelligence and Data Mining
Subjects:
Online Access:http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf
id doaj-3a3fceb3a2a44128950c7e49697db124
record_format Article
spelling doaj-3a3fceb3a2a44128950c7e49697db1242020-11-24T22:00:46ZengShahrood University of TechnologyJournal of Artificial Intelligence and Data Mining2322-52112322-44442016-01-0141273410.5829/idosi.JAIDM.2016.04.01.04495A new model for persian multi-part words edition based on statistical machine translationM. Zahedi0A. Arjomandzadeh1School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdfPersian Multi-Part WordsStatistical Machine TranslationFertility-based IBM ModelSyntax-Based DecoderSpacing Rules
collection DOAJ
language English
format Article
sources DOAJ
author M. Zahedi
A. Arjomandzadeh
spellingShingle M. Zahedi
A. Arjomandzadeh
A new model for persian multi-part words edition based on statistical machine translation
Journal of Artificial Intelligence and Data Mining
Persian Multi-Part Words
Statistical Machine Translation
Fertility-based IBM Model
Syntax-Based Decoder
Spacing Rules
author_facet M. Zahedi
A. Arjomandzadeh
author_sort M. Zahedi
title A new model for persian multi-part words edition based on statistical machine translation
title_short A new model for persian multi-part words edition based on statistical machine translation
title_full A new model for persian multi-part words edition based on statistical machine translation
title_fullStr A new model for persian multi-part words edition based on statistical machine translation
title_full_unstemmed A new model for persian multi-part words edition based on statistical machine translation
title_sort new model for persian multi-part words edition based on statistical machine translation
publisher Shahrood University of Technology
series Journal of Artificial Intelligence and Data Mining
issn 2322-5211
2322-4444
publishDate 2016-01-01
description Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.
topic Persian Multi-Part Words
Statistical Machine Translation
Fertility-based IBM Model
Syntax-Based Decoder
Spacing Rules
url http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf
work_keys_str_mv AT mzahedi anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation
AT aarjomandzadeh anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation
AT mzahedi newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation
AT aarjomandzadeh newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation
_version_ 1725842805335523328