Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text au...

Full description

Bibliographic Details
Main Authors:	Aleksandr Romanov, Anna Kurtukova, Alexander Shelupanov, Anastasia Fedotova, Valery Goncharov
Format:	Article
Language:	English
Published:	MDPI AG 2021-12-01
Series:	Future Internet
Subjects:	authorship text mining machine learning attribution neural networks deep learning
Online Access:	https://www.mdpi.com/1999-5903/13/1/3

id	doaj-4cc60cd8ada24257b011cfbb4fd417af
record_format	Article
spelling	doaj-4cc60cd8ada24257b011cfbb4fd417af2020-12-26T00:02:19ZengMDPI AGFuture Internet1999-59032021-12-01133310.3390/fi13010003Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural NetworksAleksandr Romanov0Anna Kurtukova1Alexander Shelupanov2Anastasia Fedotova3Valery Goncharov4Department of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, RussiaDepartment of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, RussiaDepartment of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, RussiaDepartment of Security, Tomsk State University of Control Systems and Radioelectronics, 634050 Tomsk, RussiaDepartment of Automation and Robotics, the National Research Tomsk Polytechnic University, 634050 Tomsk, RussiaThe article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.https://www.mdpi.com/1999-5903/13/1/3authorshiptext miningmachine learningattributionneural networksdeep learning
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Aleksandr Romanov Anna Kurtukova Alexander Shelupanov Anastasia Fedotova Valery Goncharov
spellingShingle	Aleksandr Romanov Anna Kurtukova Alexander Shelupanov Anastasia Fedotova Valery Goncharov Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks Future Internet authorship text mining machine learning attribution neural networks deep learning
author_facet	Aleksandr Romanov Anna Kurtukova Alexander Shelupanov Anastasia Fedotova Valery Goncharov
author_sort	Aleksandr Romanov
title	Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks
title_short	Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks
title_full	Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks
title_fullStr	Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks
title_full_unstemmed	Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks
title_sort	authorship identification of a russian-language text using support vector machine and deep neural networks
publisher	MDPI AG
series	Future Internet
issn	1999-5903
publishDate	2021-12-01
description	The article explores approaches to determining the author of a natural language text and the advantages and disadvantages of these approaches. The importance of the considered problem is due to the active digitalization of society and reassignment of most parts of the life activities online. Text authorship methods are particularly useful for information security and forensics. For example, such methods can be used to identify authors of suicide notes, and other texts are subjected to forensic examinations. Another area of application is plagiarism detection. Plagiarism detection is a relevant issue both for the field of intellectual property protection in the digital space and for the educational process. The article describes identifying the author of the Russian-language text using support vector machine (SVM) and deep neural network architectures (long short-term memory (LSTM), convolutional neural networks (CNN) with attention, Transformer). The results show that all the considered algorithms are suitable for solving the authorship identification problem, but SVM shows the best accuracy. The average accuracy of SVM reaches 96%. This is due to thoroughly chosen parameters and feature space, which includes statistical and semantic features (including those extracted as a result of an aspect analysis). Deep neural networks are inferior to SVM in accuracy and reach only 93%. The study also includes an evaluation of the impact of attacks on the method on models’ accuracy. Experiments show that the SVM-based methods are unstable to deliberate text anonymization. In comparison, the loss in accuracy of deep neural networks does not exceed 20%. Transformer architecture is the most effective for anonymized texts and allows 81% accuracy to be achieved.
topic	authorship text mining machine learning attribution neural networks deep learning
url	https://www.mdpi.com/1999-5903/13/1/3
work_keys_str_mv	AT aleksandrromanov authorshipidentificationofarussianlanguagetextusingsupportvectormachineanddeepneuralnetworks AT annakurtukova authorshipidentificationofarussianlanguagetextusingsupportvectormachineanddeepneuralnetworks AT alexandershelupanov authorshipidentificationofarussianlanguagetextusingsupportvectormachineanddeepneuralnetworks AT anastasiafedotova authorshipidentificationofarussianlanguagetextusingsupportvectormachineanddeepneuralnetworks AT valerygoncharov authorshipidentificationofarussianlanguagetextusingsupportvectormachineanddeepneuralnetworks
_version_	1724370731758780416

Authorship Identification of a Russian-Language Text Using Support Vector Machine and Deep Neural Networks

Similar Items