The Method for Reducing the Term Vector Size for Category Classification of Text Documents

The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement su...

Full description

Bibliographic Details
Main Authors: Golub T.V., Tiahunova M. Yu.
Format: Article
Language:English
Published: Academy of Sciences of Moldova 2019-06-01
Series:Problems of the Regional Energetics
Subjects:
a
Online Access:http://journal.ie.asm.md/assets/files/09_12_41_2019.pdf
id doaj-ca95a91462a7421da48cbddfbf377946
record_format Article
spelling doaj-ca95a91462a7421da48cbddfbf3779462020-11-25T01:13:24ZengAcademy of Sciences of MoldovaProblems of the Regional Energetics1857-00702019-06-01411-2849410.5281/zenodo.3240216The Method for Reducing the Term Vector Size for Category Classification of Text DocumentsGolub T.V.0Tiahunova M. Yu.1Zaporozhye National Technical University Zaporozhye, UkraineZaporozhye National Technical University Zaporozhye, UkraineThe article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement subsuming process at the stage of training a certain system. As a result of the analysis of the obtained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to zero. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to determine whether or not a document belonged to a specific category decreased without losing the classification quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved.http://journal.ie.asm.md/assets/files/09_12_41_2019.pdftext classificationastemming
collection DOAJ
language English
format Article
sources DOAJ
author Golub T.V.
Tiahunova M. Yu.
spellingShingle Golub T.V.
Tiahunova M. Yu.
The Method for Reducing the Term Vector Size for Category Classification of Text Documents
Problems of the Regional Energetics
text classification
a
stemming
author_facet Golub T.V.
Tiahunova M. Yu.
author_sort Golub T.V.
title The Method for Reducing the Term Vector Size for Category Classification of Text Documents
title_short The Method for Reducing the Term Vector Size for Category Classification of Text Documents
title_full The Method for Reducing the Term Vector Size for Category Classification of Text Documents
title_fullStr The Method for Reducing the Term Vector Size for Category Classification of Text Documents
title_full_unstemmed The Method for Reducing the Term Vector Size for Category Classification of Text Documents
title_sort method for reducing the term vector size for category classification of text documents
publisher Academy of Sciences of Moldova
series Problems of the Regional Energetics
issn 1857-0070
publishDate 2019-06-01
description The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement subsuming process at the stage of training a certain system. As a result of the analysis of the obtained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to zero. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to determine whether or not a document belonged to a specific category decreased without losing the classification quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved.
topic text classification
a
stemming
url http://journal.ie.asm.md/assets/files/09_12_41_2019.pdf
work_keys_str_mv AT golubtv themethodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments
AT tiahunovamyu themethodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments
AT golubtv methodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments
AT tiahunovamyu methodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments
_version_ 1725162635978080256