Summary: | 碩士 === 國立成功大學 === 資訊管理研究所 === 105 === K-fold cross validation is one of accuracy estimation methods used by many types of experimental research. Stratification method, however, is seldom performed in order to get more representative data in each partition. Stratification has the advantage of reducing the variance of estimators and thus better estimate the true accuracy. This research looks that stratification or imbalance dataset from a different perspective. General dataset is used to develop new algorithm from standard stratification on K-fold cross validation or investigate estimator from bias and variance. Imbalance dataset is used to discuss the performance of applying stratification from recall and precision or the others measure view in rare class value situation. Many types of research recommend their algorithm without the appropriate parametric method for statistical comparison. Therefore the purpose of this study is to compare these stratified methods in same condition environment, decision tree and k-nearest neighbors algorithm through reasonable statistical comparison. The results demonstrated that estimated value performance will closely with K-fold cross validation whether stratification implemented or not from single or multiple general or imbalanced dataset. Furthermore, when considering the factor of time complexity assuming stable estimator, standard stratification could be used on K-fold cross validation. By using advance stratification which takes into account features between data and data, the estimator will relatively more stable than standard stratification.
|