The Method for Reducing the Term Vector Size for Category Classification of Text Documents
The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement su...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Academy of Sciences of Moldova
2019-06-01
|
Series: | Problems of the Regional Energetics |
Subjects: | |
Online Access: | http://journal.ie.asm.md/assets/files/09_12_41_2019.pdf |
id |
doaj-ca95a91462a7421da48cbddfbf377946 |
---|---|
record_format |
Article |
spelling |
doaj-ca95a91462a7421da48cbddfbf3779462020-11-25T01:13:24ZengAcademy of Sciences of MoldovaProblems of the Regional Energetics1857-00702019-06-01411-2849410.5281/zenodo.3240216The Method for Reducing the Term Vector Size for Category Classification of Text DocumentsGolub T.V.0Tiahunova M. Yu.1Zaporozhye National Technical University Zaporozhye, UkraineZaporozhye National Technical University Zaporozhye, UkraineThe article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement subsuming process at the stage of training a certain system. As a result of the analysis of the obtained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to zero. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to determine whether or not a document belonged to a specific category decreased without losing the classification quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved.http://journal.ie.asm.md/assets/files/09_12_41_2019.pdftext classificationastemming |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Golub T.V. Tiahunova M. Yu. |
spellingShingle |
Golub T.V. Tiahunova M. Yu. The Method for Reducing the Term Vector Size for Category Classification of Text Documents Problems of the Regional Energetics text classification a stemming |
author_facet |
Golub T.V. Tiahunova M. Yu. |
author_sort |
Golub T.V. |
title |
The Method for Reducing the Term Vector Size for Category Classification of Text Documents |
title_short |
The Method for Reducing the Term Vector Size for Category Classification of Text Documents |
title_full |
The Method for Reducing the Term Vector Size for Category Classification of Text Documents |
title_fullStr |
The Method for Reducing the Term Vector Size for Category Classification of Text Documents |
title_full_unstemmed |
The Method for Reducing the Term Vector Size for Category Classification of Text Documents |
title_sort |
method for reducing the term vector size for category classification of text documents |
publisher |
Academy of Sciences of Moldova |
series |
Problems of the Regional Energetics |
issn |
1857-0070 |
publishDate |
2019-06-01 |
description |
The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement subsuming process at the stage of training a certain system. As a result of the analysis of the obtained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to zero. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to determine whether or not a document belonged to a specific category decreased without losing the classification quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved. |
topic |
text classification a stemming |
url |
http://journal.ie.asm.md/assets/files/09_12_41_2019.pdf |
work_keys_str_mv |
AT golubtv themethodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments AT tiahunovamyu themethodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments AT golubtv methodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments AT tiahunovamyu methodforreducingthetermvectorsizeforcategoryclassificationoftextdocuments |
_version_ |
1725162635978080256 |