A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use spac...

Full description

Bibliographic Details
Main Authors:	M. Zahedi, A. Arjomandzadeh
Format:	Article
Language:	English
Published:	Shahrood University of Technology 2016-01-01
Series:	Journal of Artificial Intelligence and Data Mining
Subjects:	Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules
Online Access:	http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf

id	doaj-3a3fceb3a2a44128950c7e49697db124
record_format	Article
spelling	doaj-3a3fceb3a2a44128950c7e49697db1242020-11-24T22:00:46ZengShahrood University of TechnologyJournal of Artificial Intelligence and Data Mining2322-52112322-44442016-01-0141273410.5829/idosi.JAIDM.2016.04.01.04495A new model for persian multi-part words edition based on statistical machine translationM. Zahedi0A. Arjomandzadeh1School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.School of Computer Engineering & Information Technology, University of Shahrood, Shahrood,Iran.Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdfPersian Multi-Part WordsStatistical Machine TranslationFertility-based IBM ModelSyntax-Based DecoderSpacing Rules
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	M. Zahedi A. Arjomandzadeh
spellingShingle	M. Zahedi A. Arjomandzadeh A new model for persian multi-part words edition based on statistical machine translation Journal of Artificial Intelligence and Data Mining Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules
author_facet	M. Zahedi A. Arjomandzadeh
author_sort	M. Zahedi
title	A new model for persian multi-part words edition based on statistical machine translation
title_short	A new model for persian multi-part words edition based on statistical machine translation
title_full	A new model for persian multi-part words edition based on statistical machine translation
title_fullStr	A new model for persian multi-part words edition based on statistical machine translation
title_full_unstemmed	A new model for persian multi-part words edition based on statistical machine translation
title_sort	new model for persian multi-part words edition based on statistical machine translation
publisher	Shahrood University of Technology
series	Journal of Artificial Intelligence and Data Mining
issn	2322-5211 2322-4444
publishDate	2016-01-01
description	Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.
topic	Persian Multi-Part Words Statistical Machine Translation Fertility-based IBM Model Syntax-Based Decoder Spacing Rules
url	http://jad.shahroodut.ac.ir/article_495_4779dc10c5df9c7ba8cddf674d016abd.pdf
work_keys_str_mv	AT mzahedi anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT aarjomandzadeh anewmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT mzahedi newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation AT aarjomandzadeh newmodelforpersianmultipartwordseditionbasedonstatisticalmachinetranslation
_version_	1725842805335523328

A new model for persian multi-part words edition based on statistical machine translation

Similar Items