Development of the combined method of identification of near duplicates in electronic scientific works
The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distan...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
PC Technology Center
2021-08-01
|
Series: | Eastern-European Journal of Enterprise Technologies |
Subjects: | |
Online Access: | http://journals.uran.ua/eejet/article/view/238318 |
id |
doaj-4d490559c38d4672af8b32247f11345f |
---|---|
record_format |
Article |
spelling |
doaj-4d490559c38d4672af8b32247f11345f2021-09-03T14:06:28ZengPC Technology CenterEastern-European Journal of Enterprise Technologies1729-37741729-40612021-08-0144(112)576310.15587/1729-4061.2021.238318276021Development of the combined method of identification of near duplicates in electronic scientific worksPetro Lizunov0https://orcid.org/0000-0003-2924-3025Andrii Biloshchytskyi1https://orcid.org/0000-0001-9548-1959Alexander Kuchansky2https://orcid.org/0000-0003-1277-8031Yurii Andrashko3https://orcid.org/0000-0003-2306-8377Svitlana Biloshchytska4https://orcid.org/0000-0002-0856-5474Oleg Serbin5https://orcid.org/0000-0003-3119-690XKyiv National University of Construction and ArchitectureAstana IT University; Taras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv Uzhhorod National UniversityTaras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.http://journals.uran.ua/eejet/article/view/238318near-duplicateelectronic scientific paperantiplagiarism systemlocally sensitive hashing |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin |
spellingShingle |
Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin Development of the combined method of identification of near duplicates in electronic scientific works Eastern-European Journal of Enterprise Technologies near-duplicate electronic scientific paper antiplagiarism system locally sensitive hashing |
author_facet |
Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin |
author_sort |
Petro Lizunov |
title |
Development of the combined method of identification of near duplicates in electronic scientific works |
title_short |
Development of the combined method of identification of near duplicates in electronic scientific works |
title_full |
Development of the combined method of identification of near duplicates in electronic scientific works |
title_fullStr |
Development of the combined method of identification of near duplicates in electronic scientific works |
title_full_unstemmed |
Development of the combined method of identification of near duplicates in electronic scientific works |
title_sort |
development of the combined method of identification of near duplicates in electronic scientific works |
publisher |
PC Technology Center |
series |
Eastern-European Journal of Enterprise Technologies |
issn |
1729-3774 1729-4061 |
publishDate |
2021-08-01 |
description |
The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc. |
topic |
near-duplicate electronic scientific paper antiplagiarism system locally sensitive hashing |
url |
http://journals.uran.ua/eejet/article/view/238318 |
work_keys_str_mv |
AT petrolizunov developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT andriibiloshchytskyi developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT alexanderkuchansky developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT yuriiandrashko developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT svitlanabiloshchytska developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT olegserbin developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks |
_version_ |
1717816372069662720 |