Instance Selection and Data Discretization Influence on Classifier’s Performance

碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noi...

Full description

Bibliographic Details
Main Authors: Tzu-Ming Yen, 顏子明
Other Authors: Chih-Fong Tsai
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/247wdz
id ndltd-TW-107NCU05396037
record_format oai_dc
spelling ndltd-TW-107NCU053960372019-10-22T05:28:09Z http://ndltd.ncl.edu.tw/handle/247wdz Instance Selection and Data Discretization Influence on Classifier’s Performance 樣本選取與資料離散化對於分類器效果之影響 Tzu-Ming Yen 顏子明 碩士 國立中央大學 資訊管理學系 107 "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%. Chih-Fong Tsai Kuen-Liang Sue 蔡志豐 蘇坤良 2019 學位論文 ; thesis 96 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.
author2 Chih-Fong Tsai
author_facet Chih-Fong Tsai
Tzu-Ming Yen
顏子明
author Tzu-Ming Yen
顏子明
spellingShingle Tzu-Ming Yen
顏子明
Instance Selection and Data Discretization Influence on Classifier’s Performance
author_sort Tzu-Ming Yen
title Instance Selection and Data Discretization Influence on Classifier’s Performance
title_short Instance Selection and Data Discretization Influence on Classifier’s Performance
title_full Instance Selection and Data Discretization Influence on Classifier’s Performance
title_fullStr Instance Selection and Data Discretization Influence on Classifier’s Performance
title_full_unstemmed Instance Selection and Data Discretization Influence on Classifier’s Performance
title_sort instance selection and data discretization influence on classifier’s performance
publishDate 2019
url http://ndltd.ncl.edu.tw/handle/247wdz
work_keys_str_mv AT tzumingyen instanceselectionanddatadiscretizationinfluenceonclassifiersperformance
AT yánzimíng instanceselectionanddatadiscretizationinfluenceonclassifiersperformance
AT tzumingyen yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng
AT yánzimíng yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng
_version_ 1719273910237659136