Feature Selection Method Based On Term Frequency, Location And Category Relations

碩士 === 銘傳大學 === 資訊管理學系碩士班 === 106 === With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly...

Full description

Bibliographic Details
Main Authors: ZHENG, KAI-YUAN, 鄭開元
Other Authors: TING, MING-YUNG
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/n7n4j2
id ndltd-TW-106MCU00396006
record_format oai_dc
spelling ndltd-TW-106MCU003960062019-10-03T03:40:45Z http://ndltd.ncl.edu.tw/handle/n7n4j2 Feature Selection Method Based On Term Frequency, Location And Category Relations 基於詞頻、位置及類別關係的特徵選擇方法 ZHENG, KAI-YUAN 鄭開元 碩士 銘傳大學 資訊管理學系碩士班 106 With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly and effectively. As an important branch of text mining, text classification is mainly the process of assigning texts to a specific category by using an algorithm in a given classification system. It is widely used in the application of rapid classification of press and publication products, web page classification, personalized news wisdom recommendation, spam filtering, user analysis, etc. The general Chinese text classification will be divided into several steps such as text preprocessing, feature selection and building a word vector matrix, constructing the classifier and testing, and classifier performance evaluation. Faced with the feature word set after the text is preprocessed, we often need to use feature selection to reduce the dimension of the feature word set to avoid problems such as inefficiency and ‘dimensional disaster’. And a good feature selection method will affect the subsequent classification effect directly. Therefore, the improvement of existing feature selection methods deserves further study and discussion. For this reason, this paper focuses on the shortcomings of feature selection by introducing the importance of term location, the term frequency of inter-category relationship, the term frequency of intra-class relationship, the degree of inter-class concentration, and the degree of intra-class dispersion. The Chi-square and the cross-entropy are improved, and a Chinese text classification algorithm based on multiple factors is proposed. The improved method of the present study is better than other methods in comparing F1 values of category items. It is more stable than other methods in classifying unbalanced documents. Whether it is a balanced document dataset or an unbalanced document dataset, our feature selection method proposed in this study does have a significant improvement over traditional methods and other methods. TING, MING-YUNG LEE, YUE-SHI 丁明勇 李御璽 2018 學位論文 ; thesis 62 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 銘傳大學 === 資訊管理學系碩士班 === 106 === With the advent of the age of big data, how to analyze and mine data has become an important topic today. Text mining is an important part of data analysis that focuses on the analysis of text data. It can help people get the information in the text more quickly and effectively. As an important branch of text mining, text classification is mainly the process of assigning texts to a specific category by using an algorithm in a given classification system. It is widely used in the application of rapid classification of press and publication products, web page classification, personalized news wisdom recommendation, spam filtering, user analysis, etc. The general Chinese text classification will be divided into several steps such as text preprocessing, feature selection and building a word vector matrix, constructing the classifier and testing, and classifier performance evaluation. Faced with the feature word set after the text is preprocessed, we often need to use feature selection to reduce the dimension of the feature word set to avoid problems such as inefficiency and ‘dimensional disaster’. And a good feature selection method will affect the subsequent classification effect directly. Therefore, the improvement of existing feature selection methods deserves further study and discussion. For this reason, this paper focuses on the shortcomings of feature selection by introducing the importance of term location, the term frequency of inter-category relationship, the term frequency of intra-class relationship, the degree of inter-class concentration, and the degree of intra-class dispersion. The Chi-square and the cross-entropy are improved, and a Chinese text classification algorithm based on multiple factors is proposed. The improved method of the present study is better than other methods in comparing F1 values of category items. It is more stable than other methods in classifying unbalanced documents. Whether it is a balanced document dataset or an unbalanced document dataset, our feature selection method proposed in this study does have a significant improvement over traditional methods and other methods.
author2 TING, MING-YUNG
author_facet TING, MING-YUNG
ZHENG, KAI-YUAN
鄭開元
author ZHENG, KAI-YUAN
鄭開元
spellingShingle ZHENG, KAI-YUAN
鄭開元
Feature Selection Method Based On Term Frequency, Location And Category Relations
author_sort ZHENG, KAI-YUAN
title Feature Selection Method Based On Term Frequency, Location And Category Relations
title_short Feature Selection Method Based On Term Frequency, Location And Category Relations
title_full Feature Selection Method Based On Term Frequency, Location And Category Relations
title_fullStr Feature Selection Method Based On Term Frequency, Location And Category Relations
title_full_unstemmed Feature Selection Method Based On Term Frequency, Location And Category Relations
title_sort feature selection method based on term frequency, location and category relations
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/n7n4j2
work_keys_str_mv AT zhengkaiyuan featureselectionmethodbasedontermfrequencylocationandcategoryrelations
AT zhèngkāiyuán featureselectionmethodbasedontermfrequencylocationandcategoryrelations
AT zhengkaiyuan jīyúcípínwèizhìjílèibiéguānxìdetèzhēngxuǎnzéfāngfǎ
AT zhèngkāiyuán jīyúcípínwèizhìjílèibiéguānxìdetèzhēngxuǎnzéfāngfǎ
_version_ 1719259120127705088