Development of the combined method of identification of near duplicates in electronic scientific works

The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distan...

Full description

Bibliographic Details
Main Authors: Petro Lizunov, Andrii Biloshchytskyi, Alexander Kuchansky, Yurii Andrashko, Svitlana Biloshchytska, Oleg Serbin
Format: Article
Language:English
Published: PC Technology Center 2021-08-01
Series:Eastern-European Journal of Enterprise Technologies
Subjects:
Online Access:http://journals.uran.ua/eejet/article/view/238318
id doaj-4d490559c38d4672af8b32247f11345f
record_format Article
spelling doaj-4d490559c38d4672af8b32247f11345f2021-09-03T14:06:28ZengPC Technology CenterEastern-European Journal of Enterprise Technologies1729-37741729-40612021-08-0144(112)576310.15587/1729-4061.2021.238318276021Development of the combined method of identification of near duplicates in electronic scientific worksPetro Lizunov0https://orcid.org/0000-0003-2924-3025Andrii Biloshchytskyi1https://orcid.org/0000-0001-9548-1959Alexander Kuchansky2https://orcid.org/0000-0003-1277-8031Yurii Andrashko3https://orcid.org/0000-0003-2306-8377Svitlana Biloshchytska4https://orcid.org/0000-0002-0856-5474Oleg Serbin5https://orcid.org/0000-0003-3119-690XKyiv National University of Construction and ArchitectureAstana IT University; Taras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv Uzhhorod National UniversityTaras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.http://journals.uran.ua/eejet/article/view/238318near-duplicateelectronic scientific paperantiplagiarism systemlocally sensitive hashing
collection DOAJ
language English
format Article
sources DOAJ
author Petro Lizunov
Andrii Biloshchytskyi
Alexander Kuchansky
Yurii Andrashko
Svitlana Biloshchytska
Oleg Serbin
spellingShingle Petro Lizunov
Andrii Biloshchytskyi
Alexander Kuchansky
Yurii Andrashko
Svitlana Biloshchytska
Oleg Serbin
Development of the combined method of identification of near duplicates in electronic scientific works
Eastern-European Journal of Enterprise Technologies
near-duplicate
electronic scientific paper
antiplagiarism system
locally sensitive hashing
author_facet Petro Lizunov
Andrii Biloshchytskyi
Alexander Kuchansky
Yurii Andrashko
Svitlana Biloshchytska
Oleg Serbin
author_sort Petro Lizunov
title Development of the combined method of identification of near duplicates in electronic scientific works
title_short Development of the combined method of identification of near duplicates in electronic scientific works
title_full Development of the combined method of identification of near duplicates in electronic scientific works
title_fullStr Development of the combined method of identification of near duplicates in electronic scientific works
title_full_unstemmed Development of the combined method of identification of near duplicates in electronic scientific works
title_sort development of the combined method of identification of near duplicates in electronic scientific works
publisher PC Technology Center
series Eastern-European Journal of Enterprise Technologies
issn 1729-3774
1729-4061
publishDate 2021-08-01
description The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.
topic near-duplicate
electronic scientific paper
antiplagiarism system
locally sensitive hashing
url http://journals.uran.ua/eejet/article/view/238318
work_keys_str_mv AT petrolizunov developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
AT andriibiloshchytskyi developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
AT alexanderkuchansky developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
AT yuriiandrashko developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
AT svitlanabiloshchytska developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
AT olegserbin developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
_version_ 1717816372069662720