A Study of Data Pre-process: the Integration of Imputation and Instance Selection

碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing...

Full description

Bibliographic Details
Main Authors: Fu-yu Chang, 張復喻
Other Authors: Chih-fong Tsai
Format: Others
Language:zh-TW
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/4kqb4j
id ndltd-TW-102NCU05396029
record_format oai_dc
spelling ndltd-TW-102NCU053960292019-05-15T21:32:34Z http://ndltd.ncl.edu.tw/handle/4kqb4j A Study of Data Pre-process: the Integration of Imputation and Instance Selection 資料前處理:整合補值法與樣本選取之研究 Fu-yu Chang 張復喻 碩士 國立中央大學 資訊管理學系 102 In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process. Chih-fong Tsai 蔡志豐 2014 學位論文 ; thesis 85 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process.
author2 Chih-fong Tsai
author_facet Chih-fong Tsai
Fu-yu Chang
張復喻
author Fu-yu Chang
張復喻
spellingShingle Fu-yu Chang
張復喻
A Study of Data Pre-process: the Integration of Imputation and Instance Selection
author_sort Fu-yu Chang
title A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_short A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_full A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_fullStr A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_full_unstemmed A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_sort study of data pre-process: the integration of imputation and instance selection
publishDate 2014
url http://ndltd.ncl.edu.tw/handle/4kqb4j
work_keys_str_mv AT fuyuchang astudyofdatapreprocesstheintegrationofimputationandinstanceselection
AT zhāngfùyù astudyofdatapreprocesstheintegrationofimputationandinstanceselection
AT fuyuchang zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū
AT zhāngfùyù zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū
AT fuyuchang studyofdatapreprocesstheintegrationofimputationandinstanceselection
AT zhāngfùyù studyofdatapreprocesstheintegrationofimputationandinstanceselection
_version_ 1719115435885985792