Feature Selection Method Based On Term Frequency, Location And Category Relations

碩士 === 銘傳大學 === 資訊管理學系碩士班 === 106 === With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly...

Full description

Bibliographic Details
Main Authors:	ZHENG, KAI-YUAN, 鄭開元
Other Authors:	TING, MING-YUNG
Format:	Others
Language:	zh-TW
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/n7n4j2

id	ndltd-TW-106MCU00396006
record_format	oai_dc
spelling	ndltd-TW-106MCU003960062019-10-03T03:40:45Z http://ndltd.ncl.edu.tw/handle/n7n4j2 Feature Selection Method Based On Term Frequency, Location And Category Relations 基於詞頻、位置及類別關係的特徵選擇方法 ZHENG, KAI-YUAN 鄭開元碩士銘傳大學資訊管理學系碩士班 106 With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly and effectively. As an important branch of text mining, text classification is mainly the process of assigning texts to a specific category by using an algorithm in a given classification system. It is widely used in the application of rapid classification of press and publication products, web page classification, personalized news wisdom recommendation, spam filtering, user analysis, etc. The general Chinese text classification will be divided into several steps such as text preprocessing, feature selection and building a word vector matrix, constructing the classifier and testing, and classifier performance evaluation. Faced with the feature word set after the text is preprocessed, we often need to use feature selection to reduce the dimension of the feature word set to avoid problems such as inefficiency and ‘dimensional disaster’. And a good feature selection method will affect the subsequent classification effect directly. Therefore, the improvement of existing feature selection methods deserves further study and discussion. For this reason, this paper focuses on the shortcomings of feature selection by introducing the importance of term location, the term frequency of inter-category relationship, the term frequency of intra-class relationship, the degree of inter-class concentration, and the degree of intra-class dispersion. The Chi-square and the cross-entropy are improved, and a Chinese text classification algorithm based on multiple factors is proposed. The improved method of the present study is better than other methods in comparing F1 values of category items. It is more stable than other methods in classifying unbalanced documents. Whether it is a balanced document dataset or an unbalanced document dataset, our feature selection method proposed in this study does have a significant improvement over traditional methods and other methods. TING, MING-YUNG LEE, YUE-SHI 丁明勇李御璽 2018 學位論文 ; thesis 62 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 銘傳大學 === 資訊管理學系碩士班 === 106 === With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly and effectively. As an important branch of text mining, text classification is mainly the process of assigning texts to a specific category by using an algorithm in a given classification system. It is widely used in the application of rapid classification of press and publication products, web page classification, personalized news wisdom recommendation, spam filtering, user analysis, etc. The general Chinese text classification will be divided into several steps such as text preprocessing, feature selection and building a word vector matrix, constructing the classifier and testing, and classifier performance evaluation. Faced with the feature word set after the text is preprocessed, we often need to use feature selection to reduce the dimension of the feature word set to avoid problems such as inefficiency and ‘dimensional disaster’. And a good feature selection method will affect the subsequent classification effect directly. Therefore, the improvement of existing feature selection methods deserves further study and discussion. For this reason, this paper focuses on the shortcomings of feature selection by introducing the importance of term location, the term frequency of inter-category relationship, the term frequency of intra-class relationship, the degree of inter-class concentration, and the degree of intra-class dispersion. The Chi-square and the cross-entropy are improved, and a Chinese text classification algorithm based on multiple factors is proposed. The improved method of the present study is better than other methods in comparing F1 values of category items. It is more stable than other methods in classifying unbalanced documents. Whether it is a balanced document dataset or an unbalanced document dataset, our feature selection method proposed in this study does have a significant improvement over traditional methods and other methods.
author2	TING, MING-YUNG
author_facet	TING, MING-YUNG ZHENG, KAI-YUAN 鄭開元
author	ZHENG, KAI-YUAN 鄭開元
spellingShingle	ZHENG, KAI-YUAN 鄭開元 Feature Selection Method Based On Term Frequency, Location And Category Relations
author_sort	ZHENG, KAI-YUAN
title	Feature Selection Method Based On Term Frequency, Location And Category Relations
title_short	Feature Selection Method Based On Term Frequency, Location And Category Relations
title_full	Feature Selection Method Based On Term Frequency, Location And Category Relations
title_fullStr	Feature Selection Method Based On Term Frequency, Location And Category Relations
title_full_unstemmed	Feature Selection Method Based On Term Frequency, Location And Category Relations
title_sort	feature selection method based on term frequency, location and category relations
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/n7n4j2
work_keys_str_mv	AT zhengkaiyuan featureselectionmethodbasedontermfrequencylocationandcategoryrelations AT zhèngkāiyuán featureselectionmethodbasedontermfrequencylocationandcategoryrelations AT zhengkaiyuan jīyúcípínwèizhìjílèibiéguānxìdetèzhēngxuǎnzéfāngfǎ AT zhèngkāiyuán jīyúcípínwèizhìjílèibiéguānxìdetèzhēngxuǎnzéfāngfǎ
_version_	1719259120127705088

Feature Selection Method Based On Term Frequency, Location And Category Relations

Similar Items