Feature Selection in Data Discretization

碩士 === 國立中央大學 === 資訊管理學系 === 106 === In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attrib...

Full description

Bibliographic Details
Main Authors: Yu-Chi CHEN, 陳鈺錡
Other Authors: Chih-Fong Tsai
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/k2kf74
id ndltd-TW-106NCU05396042
record_format oai_dc
spelling ndltd-TW-106NCU053960422019-10-31T05:22:23Z http://ndltd.ncl.edu.tw/handle/k2kf74 Feature Selection in Data Discretization 特徵選取於資料離散化之影響 Yu-Chi CHEN 陳鈺錡 碩士 國立中央大學 資訊管理學系 106 In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically. Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection. The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization. In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance. To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%. Chih-Fong Tsai Kuen-Liang Sue 蔡志豐 蘇坤良 2018 學位論文 ; thesis 95 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊管理學系 === 106 === In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically. Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection. The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization. In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance. To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%.
author2 Chih-Fong Tsai
author_facet Chih-Fong Tsai
Yu-Chi CHEN
陳鈺錡
author Yu-Chi CHEN
陳鈺錡
spellingShingle Yu-Chi CHEN
陳鈺錡
Feature Selection in Data Discretization
author_sort Yu-Chi CHEN
title Feature Selection in Data Discretization
title_short Feature Selection in Data Discretization
title_full Feature Selection in Data Discretization
title_fullStr Feature Selection in Data Discretization
title_full_unstemmed Feature Selection in Data Discretization
title_sort feature selection in data discretization
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/k2kf74
work_keys_str_mv AT yuchichen featureselectionindatadiscretization
AT chényùqí featureselectionindatadiscretization
AT yuchichen tèzhēngxuǎnqǔyúzīliàolísànhuàzhīyǐngxiǎng
AT chényùqí tèzhēngxuǎnqǔyúzīliàolísànhuàzhīyǐngxiǎng
_version_ 1719284373689204736