Devising an entropy-based approach for identifying patterns in multilingual texts

Even though the plagiarism identification issue remains relevant, modern detection methods are still resource-intensive. This paper reports a more efficient alternative to existing solutions. The devised system for identifying patterns in multilingual texts compares two texts and determines, by us...

Full description

Bibliographic Details
Main Authors: Gulnur Yerkebulan, Valentina Kulikova, Vladimir Kulikov, Zaru Kulsharipova
Format: Article
Language:English
Published: PC Technology Center 2021-04-01
Series:Eastern-European Journal of Enterprise Technologies
Subjects:
Online Access:http://journals.uran.ua/eejet/article/view/228695
id doaj-29f07e72af924cdc80568bf1b95fd094
record_format Article
spelling doaj-29f07e72af924cdc80568bf1b95fd0942021-05-11T13:10:04ZengPC Technology CenterEastern-European Journal of Enterprise Technologies1729-37741729-40612021-04-0122 (110)162210.15587/1729-4061.2021.228695266246Devising an entropy-based approach for identifying patterns in multilingual textsGulnur Yerkebulan0https://orcid.org/0000-0002-8317-1758Valentina Kulikova1https://orcid.org/0000-0001-8198-2672Vladimir Kulikov2https://orcid.org/0000-0002-6352-0949Zaru Kulsharipova3https://orcid.org/0000-0001-6170-099XManash Kozybayev North Kazakhstan UniversityManash Kozybayev North Kazakhstan UniversityManash Kozybayev North Kazakhstan UniversityPavlodar Pedagogical UniversityEven though the plagiarism identification issue remains relevant, modern detection methods are still resource-intensive. This paper reports a more efficient alternative to existing solutions. The devised system for identifying patterns in multilingual texts compares two texts and determines, by using different approaches, whether the second text is a translation of the first or not. This study's approach is based on Renyi entropy. The original text from an English writer's work and five texts in the Russian language were selected for this research. The real and "fake" translations that were chosen included translations by Google Translator and Yandex Translator, an author's book translation, a text from another work by an English writer, and a fake text. The fake text represents a text compiled with the same frequency of keywords as in the authentic text. Upon forming a key series of high-frequency words for the original text, the relevant key series for other texts were identified. Then the entropies for the texts were calculated when they were divided into "sentences" and "paragraphs". A Minkowski metric was used to calculate the proximity of the texts. It underlies the calculations of a Hamming distance, the Cartesian distance, the distance between the centers of masses, the distance between the geometric centers, and the distance between the centers of parametric means. It was found that the proximity of texts is best determined by calculating the relative distances between the centers of parametric means (for "fake" texts ‒ exceeding 3, for translations ‒ less than 1). Calculating the proximity of texts by using the algorithm based on Renyi entropy, reported in this work, makes it possible to save resources and time compared to methods based on neural networks. All the raw data and an example of the entropy calculation on php are publicly availablehttp://journals.uran.ua/eejet/article/view/228695google translatoryandex.translatorrenyi entropyminkowski metrichamming distance
collection DOAJ
language English
format Article
sources DOAJ
author Gulnur Yerkebulan
Valentina Kulikova
Vladimir Kulikov
Zaru Kulsharipova
spellingShingle Gulnur Yerkebulan
Valentina Kulikova
Vladimir Kulikov
Zaru Kulsharipova
Devising an entropy-based approach for identifying patterns in multilingual texts
Eastern-European Journal of Enterprise Technologies
google translator
yandex.translator
renyi entropy
minkowski metric
hamming distance
author_facet Gulnur Yerkebulan
Valentina Kulikova
Vladimir Kulikov
Zaru Kulsharipova
author_sort Gulnur Yerkebulan
title Devising an entropy-based approach for identifying patterns in multilingual texts
title_short Devising an entropy-based approach for identifying patterns in multilingual texts
title_full Devising an entropy-based approach for identifying patterns in multilingual texts
title_fullStr Devising an entropy-based approach for identifying patterns in multilingual texts
title_full_unstemmed Devising an entropy-based approach for identifying patterns in multilingual texts
title_sort devising an entropy-based approach for identifying patterns in multilingual texts
publisher PC Technology Center
series Eastern-European Journal of Enterprise Technologies
issn 1729-3774
1729-4061
publishDate 2021-04-01
description Even though the plagiarism identification issue remains relevant, modern detection methods are still resource-intensive. This paper reports a more efficient alternative to existing solutions. The devised system for identifying patterns in multilingual texts compares two texts and determines, by using different approaches, whether the second text is a translation of the first or not. This study's approach is based on Renyi entropy. The original text from an English writer's work and five texts in the Russian language were selected for this research. The real and "fake" translations that were chosen included translations by Google Translator and Yandex Translator, an author's book translation, a text from another work by an English writer, and a fake text. The fake text represents a text compiled with the same frequency of keywords as in the authentic text. Upon forming a key series of high-frequency words for the original text, the relevant key series for other texts were identified. Then the entropies for the texts were calculated when they were divided into "sentences" and "paragraphs". A Minkowski metric was used to calculate the proximity of the texts. It underlies the calculations of a Hamming distance, the Cartesian distance, the distance between the centers of masses, the distance between the geometric centers, and the distance between the centers of parametric means. It was found that the proximity of texts is best determined by calculating the relative distances between the centers of parametric means (for "fake" texts ‒ exceeding 3, for translations ‒ less than 1). Calculating the proximity of texts by using the algorithm based on Renyi entropy, reported in this work, makes it possible to save resources and time compared to methods based on neural networks. All the raw data and an example of the entropy calculation on php are publicly available
topic google translator
yandex.translator
renyi entropy
minkowski metric
hamming distance
url http://journals.uran.ua/eejet/article/view/228695
work_keys_str_mv AT gulnuryerkebulan devisinganentropybasedapproachforidentifyingpatternsinmultilingualtexts
AT valentinakulikova devisinganentropybasedapproachforidentifyingpatternsinmultilingualtexts
AT vladimirkulikov devisinganentropybasedapproachforidentifyingpatternsinmultilingualtexts
AT zarukulsharipova devisinganentropybasedapproachforidentifyingpatternsinmultilingualtexts
_version_ 1721444339448545280