Summary: | 碩士 === 國立中央大學 === 資訊管理學系 === 106 === In the reality, data are always not “clean” as we thought. Thus, we need to figure out and ensure data quality by data pre-processing. There are many problems that we must be solved, like high dimensional data may include irrelevant and redundant features (attributes of the data). Besides, data may include many continuous attributes that would be hard to understand and explain. If people use these “unclean” data, it might decrease model prediction performance dramatically.
Previous researches show advantages derived from discretization are the reduction and the simplification of data, making the model learning faster and yielding more accurate, compact and shorter results; and noise information possibly presents in the data is reduced. It could avoid overfitting and let the data curve smoothly. In addition, feature selection is a common method for data pre-processing. By this way, it can reduce the time complexity during model training and identify important features to improve the classification accuracy of the model. Currently, there are few researches discussing the pre-processing methods by combining discretization and feature selection at the same time. Thus, this paper focuses on the optimal combination of data pre-processing by discretization and feature selection.
The experiment exploits three popular feature selection methods, which are GA(Genetic Algorithm), DT(Decision Tree Algorithm), and PCA(Principal Components Analysis). In this experiment, EWD(Equal-width discretization), EFD(Equal-frequency discretization), MDLP(Minimum Description Length Principle), and ChiMerge are used for discretization.
In order to explore the optimal combination of discretization and feature selection, the data are collected from 10 UCI Datasets. The data dimensions are from 8 to 90 and the classification problems contains 2 to 28 classes. The comparative results are based on the average accuracy by C5.0 and SVM classifiers. Our empirical results show that the MDLP discretization method gives the best predictive performance.
To conclude, implementing feature selection before discretization can make classifiers provide higher accuracy than the ones by discretization alone. Moreover, no matter which classifier is utilized (i.e. C5.0 or SVM), combining feature selection by C4.5 first and discretization by MDLP second is the most recommended combination in this thesis. The combination could make “the average classification accuracy of the model” reaches 80.1%.
|