Feature Selection in Data Discretization

碩士 === 國立中央大學 === 資訊管理學系 === 106 === In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attrib...

Full description

Bibliographic Details
Main Authors:	Yu-Chi CHEN, 陳鈺錡
Other Authors:	Chih-Fong Tsai
Format:	Others
Language:	zh-TW
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/k2kf74

id	ndltd-TW-106NCU05396042
record_format	oai_dc
spelling	ndltd-TW-106NCU053960422019-10-31T05:22:23Z http://ndltd.ncl.edu.tw/handle/k2kf74 Feature Selection in Data Discretization 特徵選取於資料離散化之影響 Yu-Chi CHEN 陳鈺錡碩士國立中央大學資訊管理學系 106 In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically. Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection. The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization. In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance. To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%. Chih-Fong Tsai Kuen-Liang Sue 蔡志豐蘇坤良 2018 學位論文 ; thesis 95 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊管理學系 === 106 === In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically. Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection. The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization. In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance. To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%.
author2	Chih-Fong Tsai
author_facet	Chih-Fong Tsai Yu-Chi CHEN 陳鈺錡
author	Yu-Chi CHEN 陳鈺錡
spellingShingle	Yu-Chi CHEN 陳鈺錡 Feature Selection in Data Discretization
author_sort	Yu-Chi CHEN
title	Feature Selection in Data Discretization
title_short	Feature Selection in Data Discretization
title_full	Feature Selection in Data Discretization
title_fullStr	Feature Selection in Data Discretization
title_full_unstemmed	Feature Selection in Data Discretization
title_sort	feature selection in data discretization
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/k2kf74
work_keys_str_mv	AT yuchichen featureselectionindatadiscretization AT chényùqí featureselectionindatadiscretization AT yuchichen tèzhēngxuǎnqǔyúzīliàolísànhuàzhīyǐngxiǎng AT chényùqí tèzhēngxuǎnqǔyúzīliàolísànhuàzhīyǐngxiǎng
_version_	1719284373689204736

Feature Selection in Data Discretization

Similar Items