A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use spac...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Shahrood University of Technology
2016-01-01
|
Series: | Journal of Artificial Intelligence and Data Mining |
Subjects: | |
Online Access: | http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf |
id |
doaj-3a3fceb3a2a44128950c7e49697db124 |
---|---|
record_format |
Article |
spelling |
doaj-3a3fceb3a2a44128950c7e49697db1242020-11-24T22:00:46ZengShahrood University of TechnologyJournal of Artificial Intelligence and Data Mining2322-52112322-44442016-01-0141273410.5829/idosi.JAIDM.2016.04.01.04495A new model for persian multi-part words edition based on statistical machine translationM. Zahedi0A. Arjomandzadeh1School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdfPersian Multi-Part WordsStatistical Machine TranslationFertility-based IBM ModelSyntax-Based DecoderSpacing Rules |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
M. Zahedi A. Arjomandzadeh |
spellingShingle |
M. Zahedi A. Arjomandzadeh A new model for persian multi-part words edition based on statistical machine translation Journal of Artificial Intelligence and Data Mining Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules |
author_facet |
M. Zahedi A. Arjomandzadeh |
author_sort |
M. Zahedi |
title |
A new model for persian multi-part words edition based on statistical machine translation |
title_short |
A new model for persian multi-part words edition based on statistical machine translation |
title_full |
A new model for persian multi-part words edition based on statistical machine translation |
title_fullStr |
A new model for persian multi-part words edition based on statistical machine translation |
title_full_unstemmed |
A new model for persian multi-part words edition based on statistical machine translation |
title_sort |
new model for persian multi-part words edition based on statistical machine translation |
publisher |
Shahrood University of Technology |
series |
Journal of Artificial Intelligence and Data Mining |
issn |
2322-5211 2322-4444 |
publishDate |
2016-01-01 |
description |
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate. |
topic |
Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules |
url |
http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf |
work_keys_str_mv |
AT mzahedi anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT aarjomandzadeh anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT mzahedi newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT aarjomandzadeh newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation |
_version_ |
1725842805335523328 |