Summary: | 碩士 === 銘傳大學 === 資訊管理學系碩士班 === 106 === With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly and effectively.
As an important branch of text mining, text classification is mainly the process of assigning texts to a specific category by using an algorithm in a given classification system. It is widely used in the application of rapid classification of press and publication products, web page classification, personalized news wisdom recommendation, spam filtering, user analysis, etc. The general Chinese text classification will be divided into several steps such as text preprocessing, feature selection and building a word vector matrix, constructing the classifier and testing, and classifier performance evaluation.
Faced with the feature word set after the text is preprocessed, we often need to use feature selection to reduce the dimension of the feature word set to avoid problems such as inefficiency and ‘dimensional disaster’. And a good feature selection method will affect the subsequent classification effect directly. Therefore, the improvement of existing feature selection methods deserves further study and discussion.
For this reason, this paper focuses on the shortcomings of feature selection by introducing the importance of term location, the term frequency of inter-category relationship, the term frequency of intra-class relationship, the degree of inter-class concentration, and the degree of intra-class dispersion. The Chi-square and the cross-entropy are improved, and a Chinese text classification algorithm based on multiple factors is proposed.
The improved method of the present study is better than other methods in comparing F1 values of category items. It is more stable than other methods in classifying unbalanced documents. Whether it is a balanced document dataset or an unbalanced document dataset, our feature selection method proposed in this study does have a significant improvement over traditional methods and other methods.
|