Instance Selection and Data Discretization Influence on Classifier’s Performance
碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noi...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2019
|
Online Access: | http://ndltd.ncl.edu.tw/handle/247wdz |
id |
ndltd-TW-107NCU05396037 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-107NCU053960372019-10-22T05:28:09Z http://ndltd.ncl.edu.tw/handle/247wdz Instance Selection and Data Discretization Influence on Classifier’s Performance 樣本選取與資料離散化對於分類器效果之影響 Tzu-Ming Yen 顏子明 碩士 國立中央大學 資訊管理學系 107 "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques. This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers. This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%. Chih-Fong Tsai Kuen-Liang Sue 蔡志豐 蘇坤良 2019 學位論文 ; thesis 96 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊管理學系 === 107 === "Data Preprocessing" plays a pivotal role in data exploration and is the first step for the analysis process of data mining. In the real world, the quality of the big data is always unclear and uneven. For example, samples in the big data often have noise or continuous type values with low interpretability. These factors will result in inaccurate outcome if not properly pre-processed. In the literature, the concept of data sampling for instance selection had been proposed, which can be used to screen representative samples. Some studies have also shown that using discretization technology to transfer continuous values into discrete ones can effectively improve the readability of analytical exploration rules and may also improve the accuracy rate. Till now, there are no studies to explore the combination of instance selection and discretization, whether it can achieve better performance outcome than the single preprocessing techniques.
This thesis aims to discuss the influence of data preprocessing after combining instance selection and discretization, and how to achieve the optimal performance. In this study, three instance selection algorithms are selected: Instance-Based Learning Algorithm (IB3), Genetic Algorithm (GA), Decremental Reduction Optimization Procedure (DROP3), and two supervised discretization algorithms: Minimum Description Length Principle (MDLP), ChiMerge-based (ChiM). The best combination of the two types of techniques is evaluated by the performance of the K-th Nearest Neighbor (KNN) classifiers.
This study uses the 10 datasets from UCI and KEEL to explore the instance selection and discretization. According to the experimental results, it reveals that the average results of the DROP3 instance selection algorithm combined with the MDLP discretization algorithm is the more recommended combination than others, and the optimal performance can be obtained when the pre-processing of MDLP discretization is performed after the selection by DROP3, the average accuracy is promoted to 85.11%.
|
author2 |
Chih-Fong Tsai |
author_facet |
Chih-Fong Tsai Tzu-Ming Yen 顏子明 |
author |
Tzu-Ming Yen 顏子明 |
spellingShingle |
Tzu-Ming Yen 顏子明 Instance Selection and Data Discretization Influence on Classifier’s Performance |
author_sort |
Tzu-Ming Yen |
title |
Instance Selection and Data Discretization Influence on Classifier’s Performance |
title_short |
Instance Selection and Data Discretization Influence on Classifier’s Performance |
title_full |
Instance Selection and Data Discretization Influence on Classifier’s Performance |
title_fullStr |
Instance Selection and Data Discretization Influence on Classifier’s Performance |
title_full_unstemmed |
Instance Selection and Data Discretization Influence on Classifier’s Performance |
title_sort |
instance selection and data discretization influence on classifier’s performance |
publishDate |
2019 |
url |
http://ndltd.ncl.edu.tw/handle/247wdz |
work_keys_str_mv |
AT tzumingyen instanceselectionanddatadiscretizationinfluenceonclassifiersperformance AT yánzimíng instanceselectionanddatadiscretizationinfluenceonclassifiersperformance AT tzumingyen yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng AT yánzimíng yàngběnxuǎnqǔyǔzīliàolísànhuàduìyúfēnlèiqìxiàoguǒzhīyǐngxiǎng |
_version_ |
1719273910237659136 |