The Method for Reducing the Term Vector Size for Category Classification of Text Documents

The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement su...

Full description

Bibliographic Details
Main Authors: Golub T.V., Tiahunova M. Yu.
Format: Article
Language:English
Published: Academy of Sciences of Moldova 2019-06-01
Series:Problems of the Regional Energetics
Subjects:
a
Online Access:http://journal.ie.asm.md/assets/files/09_12_41_2019.pdf
Description
Summary:The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. According to the method, the term weight factors were calculated for each classification category to implement subsuming process at the stage of training a certain system. As a result of the analysis of the obtained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to zero. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to determine whether or not a document belonged to a specific category decreased without losing the classification quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved.
ISSN:1857-0070