A Study of Data Pre-process: the Integration of Imputation and Instance Selection
碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Published: |
2014
|
Online Access: | http://ndltd.ncl.edu.tw/handle/4kqb4j |
id |
ndltd-TW-102NCU05396029 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-102NCU053960292019-05-15T21:32:34Z http://ndltd.ncl.edu.tw/handle/4kqb4j A Study of Data Pre-process: the Integration of Imputation and Instance Selection 資料前處理:整合補值法與樣本選取之研究 Fu-yu Chang 張復喻 碩士 國立中央大學 資訊管理學系 102 In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process. Chih-fong Tsai 蔡志豐 2014 學位論文 ; thesis 85 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose.
The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4).
The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process.
|
author2 |
Chih-fong Tsai |
author_facet |
Chih-fong Tsai Fu-yu Chang 張復喻 |
author |
Fu-yu Chang 張復喻 |
spellingShingle |
Fu-yu Chang 張復喻 A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
author_sort |
Fu-yu Chang |
title |
A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
title_short |
A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
title_full |
A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
title_fullStr |
A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
title_full_unstemmed |
A Study of Data Pre-process: the Integration of Imputation and Instance Selection |
title_sort |
study of data pre-process: the integration of imputation and instance selection |
publishDate |
2014 |
url |
http://ndltd.ncl.edu.tw/handle/4kqb4j |
work_keys_str_mv |
AT fuyuchang astudyofdatapreprocesstheintegrationofimputationandinstanceselection AT zhāngfùyù astudyofdatapreprocesstheintegrationofimputationandinstanceselection AT fuyuchang zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū AT zhāngfùyù zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū AT fuyuchang studyofdatapreprocesstheintegrationofimputationandinstanceselection AT zhāngfùyù studyofdatapreprocesstheintegrationofimputationandinstanceselection |
_version_ |
1719115435885985792 |