A Study of Data Pre-process: the Integration of Imputation and Instance Selection

碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing...

Full description

Bibliographic Details
Main Authors: Fu-yu Chang, 張復喻
Other Authors: Chih-fong Tsai
Format: Others
Language:zh-TW
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/4kqb4j
Description
Summary:碩士 === 國立中央大學 === 資訊管理學系 === 102 === In practice, the collected data usually contain some missing values and noise, which are likely to degrade the data mining performance. As a result, data pre-processing step is necessary before data mining. The aim of data pre-processing is to deal with missing values and filter out noise data. In particular, “imputation” and “instance selection” are two common solutions for the data pre-processing purpose. The aim of imputation is to provide estimations for missing values by reasoning from the observed data (i.e., complete data). Although various missing value imputation algorithms have been proposed in literature, the outputs for the missing values produced by most imputation algorithms heavily rely on the complete (training) data. Therefore, if some of the complete data contains noise, it will directly affect the quality of the imputation and data mining results. In this thesis, four integration processes were proposed, in which one process is to execute instance selection first to remove several noisy (complete) data from the training set. Then, the imputation process is performed based on the reduced training set (Process 2). On the contrary, the imputation process is employed first to produce a complete training set. Then, instance selection is performed to filter out some noisy data from this set (Process 1). In or to filter out more representative data, instance selection is performed again over the outputs produced by Processes 1 and 2 (Process 3 &; Process 4). The experiments are based 31 different data sets, which contain categorical, numerical, and mixed types of data, and 10% intervals for different missing rates per dataset (i.e. from 10% to 50%). A decision tree model is then constructed to extract useful rules to recommend when (no. of sample, no. of attribute, no. of classed, type of dataset, missing rate) to use which kind of the integration process.