A Study of Data Pre-process: the Integration of Imputation and Instance Selection

碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing...

Full description

Bibliographic Details
Main Authors:	Fu-yu Chang, 張復喻
Other Authors:	Chih-fong Tsai
Format:	Others
Language:	zh-TW
Published:	2014
Online Access:	http://ndltd.ncl.edu.tw/handle/4kqb4j

id	ndltd-TW-102NCU05396029
record_format	oai_dc
spelling	ndltd-TW-102NCU053960292019-05-15T21:32:34Z http://ndltd.ncl.edu.tw/handle/4kqb4j A Study of Data Pre-process: the Integration of Imputation and Instance Selection 資料前處理：整合補值法與樣本選取之研究 Fu-yu Chang 張復喻碩士國立中央大學資訊管理學系 102 In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process. Chih-fong Tsai 蔡志豐 2014 學位論文 ; thesis 85 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process.
author2	Chih-fong Tsai
author_facet	Chih-fong Tsai Fu-yu Chang 張復喻
author	Fu-yu Chang 張復喻
spellingShingle	Fu-yu Chang 張復喻 A Study of Data Pre-process: the Integration of Imputation and Instance Selection
author_sort	Fu-yu Chang
title	A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_short	A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_full	A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_fullStr	A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_full_unstemmed	A Study of Data Pre-process: the Integration of Imputation and Instance Selection
title_sort	study of data pre-process: the integration of imputation and instance selection
publishDate	2014
url	http://ndltd.ncl.edu.tw/handle/4kqb4j
work_keys_str_mv	AT fuyuchang astudyofdatapreprocesstheintegrationofimputationandinstanceselection AT zhāngfùyù astudyofdatapreprocesstheintegrationofimputationandinstanceselection AT fuyuchang zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū AT zhāngfùyù zīliàoqiánchùlǐzhěnghébǔzhífǎyǔyàngběnxuǎnqǔzhīyánjiū AT fuyuchang studyofdatapreprocesstheintegrationofimputationandinstanceselection AT zhāngfùyù studyofdatapreprocesstheintegrationofimputationandinstanceselection
_version_	1719115435885985792

A Study of Data Pre-process: the Integration of Imputation and Instance Selection

Similar Items