THE SAMPLING METHODS FOR VALUE-ADDED DATABASE IN DATA-MINING

碩士 === 國立政治大學 === 統計研究所 === 92 === In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manuf...

Full description

Bibliographic Details
Main Author: 陳惠雯
Other Authors: Ting, Chao-Ping
Format: Others
Language:en_US
Published: 2004
Online Access:http://ndltd.ncl.edu.tw/handle/49638644905153949207
Description
Summary:碩士 === 國立政治大學 === 統計研究所 === 92 === In the wake of growing database that has already become the trend of today’s business environment within the foreseeable future, reviewing quality information from mountains of data residing on corporations or organizations’ network such as sales figures, manufacturing statistics, financial data and experimental data is clearly costly, time consuming and definitely ineffective approach. Therefore we would need a sound and effective method in obtaining only portions of the data that are representative to the population and which allow us to build the reliable model based upon the sampled data. However, sometimes we have a situation where the database is of limited in size, under such circumstance, we initiate the idea which is relatively new to adding the attributes or values into the database to enhance the quality of the data Follow through such a procedure; it is obvious that implementing a good sampling method is an important groundwork leading us to reach final destination that is obtaining a reliable predictive model. And this is our research goal that is to get an effective and representative value-added sample of by means of sampling method for building an accuracy predictive model. The concept is pretty straightforward that is if we want to get good predictive samples then we need the correct sampling methods. The sampling methods under study are simple random sample, system sample, stratified sample and uniform design. The models used are the C5.0, logistic regression, and neural network for categorical predictive variable and stepwise regression for continuous predictive variable. The results are discussed in the conclusion section. Keywords: Database、Data Mining、Sampling、Value-added database