Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes

碩士 === 中原大學 === 電子工程研究所 === 97 === Abstract Data description and classification are interesting and important tasks which are applied widely in supervised learning. In this thesis, three supervised learning methods are considered: k-Nearest Neighbor (k-NN), Support Vector Data Description (SVDD) and...

Full description

Bibliographic Details
Main Authors: Yugowati Praharsi, 游華英
Other Authors: Shaou-Gang Miaou
Format: Others
Language:en_US
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/17100265072485850834
id ndltd-TW-097CYCU5428062
record_format oai_dc
spelling ndltd-TW-097CYCU54280622015-10-13T12:04:54Z http://ndltd.ncl.edu.tw/handle/17100265072485850834 Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes 用於偵測糖尿病的監督式學習法和特徵選取 Yugowati Praharsi 游華英 碩士 中原大學 電子工程研究所 97 Abstract Data description and classification are interesting and important tasks which are applied widely in supervised learning. In this thesis, three supervised learning methods are considered: k-Nearest Neighbor (k-NN), Support Vector Data Description (SVDD) and Support Vector Machine (SVM). Feature selection in supervised learning is useful to find a feature subset that produces higher classification accuracy. Both forward selection based wrapper and correlation based filter approaches are considered in this thesis. Correlation between features and class label is measured using entropy and information gain (IG) while feature-feature correlation is calculated using Pearson correlation. This study compares the performance of three classifiers (k-NN, SVDD and SVM) with and without feature selection. It is expected that the classifiers with the proposed feature selection methods will perform better than the classifiers without feature selection. In addition, the selected feature subset can be used to describe data structure no matter what classifier types or feature selection methods are used. The data sample chosen is PIMA Indians diabetes from UCI database. The results show that forward feature selection produces the best feature subset for SVM and 5-NN. In addition, feature selection based on mean information gain and a standard deviation threshold gives the best result for 1-NN classifier and such a selection method can be considered as a substitute for forward selection. It is computationally efficient and the accuracy does not decrease significantly for SVM and 5-NN, as compared to forward selection. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in IG can be used to sort from the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritization for symptom recognition. Shaou-Gang Miaou 繆紹綱 2009 學位論文 ; thesis 62 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 中原大學 === 電子工程研究所 === 97 === Abstract Data description and classification are interesting and important tasks which are applied widely in supervised learning. In this thesis, three supervised learning methods are considered: k-Nearest Neighbor (k-NN), Support Vector Data Description (SVDD) and Support Vector Machine (SVM). Feature selection in supervised learning is useful to find a feature subset that produces higher classification accuracy. Both forward selection based wrapper and correlation based filter approaches are considered in this thesis. Correlation between features and class label is measured using entropy and information gain (IG) while feature-feature correlation is calculated using Pearson correlation. This study compares the performance of three classifiers (k-NN, SVDD and SVM) with and without feature selection. It is expected that the classifiers with the proposed feature selection methods will perform better than the classifiers without feature selection. In addition, the selected feature subset can be used to describe data structure no matter what classifier types or feature selection methods are used. The data sample chosen is PIMA Indians diabetes from UCI database. The results show that forward feature selection produces the best feature subset for SVM and 5-NN. In addition, feature selection based on mean information gain and a standard deviation threshold gives the best result for 1-NN classifier and such a selection method can be considered as a substitute for forward selection. It is computationally efficient and the accuracy does not decrease significantly for SVM and 5-NN, as compared to forward selection. Finally, among eight candidate features, glucose level is the most prominent feature for diabetes detection in all classifiers and feature selection methods under consideration. Relevancy measurement in IG can be used to sort from the most important feature to the least significant one. It can be very useful in medical applications such as defining feature prioritization for symptom recognition.
author2 Shaou-Gang Miaou
author_facet Shaou-Gang Miaou
Yugowati Praharsi
游華英
author Yugowati Praharsi
游華英
spellingShingle Yugowati Praharsi
游華英
Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
author_sort Yugowati Praharsi
title Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
title_short Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
title_full Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
title_fullStr Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
title_full_unstemmed Supervised Learning Approaches and Feature Selection - A Case Study in Diabetes
title_sort supervised learning approaches and feature selection - a case study in diabetes
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/17100265072485850834
work_keys_str_mv AT yugowatipraharsi supervisedlearningapproachesandfeatureselectionacasestudyindiabetes
AT yóuhuáyīng supervisedlearningapproachesandfeatureselectionacasestudyindiabetes
AT yugowatipraharsi yòngyúzhēncètángniàobìngdejiāndūshìxuéxífǎhétèzhēngxuǎnqǔ
AT yóuhuáyīng yòngyúzhēncètángniàobìngdejiāndūshìxuéxífǎhétèzhēngxuǎnqǔ
_version_ 1716852185301516288