Development of the combined method of identification of near duplicates in electronic scientific works

The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distan...

Full description

Bibliographic Details
Main Authors:	Petro Lizunov, Andrii Biloshchytskyi, Alexander Kuchansky, Yurii Andrashko, Svitlana Biloshchytska, Oleg Serbin
Format:	Article
Language:	English
Published:	PC Technology Center 2021-08-01
Series:	Eastern-European Journal of Enterprise Technologies
Subjects:	near-duplicate electronic scientific paper antiplagiarism system locally sensitive hashing
Online Access:	http://journals.uran.ua/eejet/article/view/238318

id	doaj-4d490559c38d4672af8b32247f11345f
record_format	Article
spelling	doaj-4d490559c38d4672af8b32247f11345f2021-09-03T14:06:28ZengPC Technology CenterEastern-European Journal of Enterprise Technologies1729-37741729-40612021-08-0144(112)576310.15587/1729-4061.2021.238318276021Development of the combined method of identification of near duplicates in electronic scientific worksPetro Lizunov0https://orcid.org/0000-0003-2924-3025Andrii Biloshchytskyi1https://orcid.org/0000-0001-9548-1959Alexander Kuchansky2https://orcid.org/0000-0003-1277-8031Yurii Andrashko3https://orcid.org/0000-0003-2306-8377Svitlana Biloshchytska4https://orcid.org/0000-0002-0856-5474Oleg Serbin5https://orcid.org/0000-0003-3119-690XKyiv National University of Construction and ArchitectureAstana IT University; Taras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv Uzhhorod National UniversityTaras Shevchenko National University of Kyiv Taras Shevchenko National University of Kyiv The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.http://journals.uran.ua/eejet/article/view/238318near-duplicateelectronic scientific paperantiplagiarism systemlocally sensitive hashing
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin
spellingShingle	Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin Development of the combined method of identification of near duplicates in electronic scientific works Eastern-European Journal of Enterprise Technologies near-duplicate electronic scientific paper antiplagiarism system locally sensitive hashing
author_facet	Petro Lizunov Andrii Biloshchytskyi Alexander Kuchansky Yurii Andrashko Svitlana Biloshchytska Oleg Serbin
author_sort	Petro Lizunov
title	Development of the combined method of identification of near duplicates in electronic scientific works
title_short	Development of the combined method of identification of near duplicates in electronic scientific works
title_full	Development of the combined method of identification of near duplicates in electronic scientific works
title_fullStr	Development of the combined method of identification of near duplicates in electronic scientific works
title_full_unstemmed	Development of the combined method of identification of near duplicates in electronic scientific works
title_sort	development of the combined method of identification of near duplicates in electronic scientific works
publisher	PC Technology Center
series	Eastern-European Journal of Enterprise Technologies
issn	1729-3774 1729-4061
publishDate	2021-08-01
description	The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.
topic	near-duplicate electronic scientific paper antiplagiarism system locally sensitive hashing
url	http://journals.uran.ua/eejet/article/view/238318
work_keys_str_mv	AT petrolizunov developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT andriibiloshchytskyi developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT alexanderkuchansky developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT yuriiandrashko developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT svitlanabiloshchytska developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks AT olegserbin developmentofthecombinedmethodofidentificationofnearduplicatesinelectronicscientificworks
_version_	1717816372069662720

Development of the combined method of identification of near duplicates in electronic scientific works

Similar Items