Instance Selection and Data Discretization Influence on Classifier’s Performance

碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noi...

Full description

Bibliographic Details
Main Authors:	Tzu-Ming Yen, 顏子明
Other Authors:	Chih-Fong Tsai
Format:	Others
Language:	zh-TW
Published:	2019
Online Access:	http://ndltd.ncl.edu.tw/handle/247wdz

id	ndltd-TW-107NCU05396037
record_format	oai_dc
spelling	ndltd-TW-107NCU053960372019-10-22T05:28:09Z http://ndltd.ncl.edu.tw/handle/247wdz Instance Selection and Data Discretization Influence on Classifier’s Performance 樣本選取與資料離散化對於分類器效果之影響 Tzu-Ming Yen 顏子明碩士國立中央大學資訊管理學系 107 "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%. Chih-Fong Tsai Kuen-Liang Sue 蔡志豐蘇坤良 2019 學位論文 ; thesis 96 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.
author2	Chih-Fong Tsai
author_facet	Chih-Fong Tsai Tzu-Ming Yen 顏子明
author	Tzu-Ming Yen 顏子明
spellingShingle	Tzu-Ming Yen 顏子明 Instance Selection and Data Discretization Influence on Classifier’s Performance
author_sort	Tzu-Ming Yen
title	Instance Selection and Data Discretization Influence on Classifier’s Performance
title_short	Instance Selection and Data Discretization Influence on Classifier’s Performance
title_full	Instance Selection and Data Discretization Influence on Classifier’s Performance
title_fullStr	Instance Selection and Data Discretization Influence on Classifier’s Performance
title_full_unstemmed	Instance Selection and Data Discretization Influence on Classifier’s Performance
title_sort	instance selection and data discretization influence on classifier’s performance
publishDate	2019
url	http://ndltd.ncl.edu.tw/handle/247wdz
work_keys_str_mv	AT tzumingyen instanceselectionanddatadiscretizationinfluenceonclassifiersperformance AT yánzimíng instanceselectionanddatadiscretizationinfluenceonclassifiersperformance AT tzumingyen yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng AT yánzimíng yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng
_version_	1719273910237659136

Instance Selection and Data Discretization Influence on Classifier’s Performance

Similar Items