Comparative Data Analytic Approach for Detection of Diabetes
Main Author: | |
---|---|
Language: | English |
Published: |
University of Cincinnati / OhioLINK
2018
|
Subjects: | |
Online Access: | http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544100930937728 |
id |
ndltd-OhioLink-oai-etd.ohiolink.edu-ucin1544100930937728 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-OhioLink-oai-etd.ohiolink.edu-ucin15441009309377282021-08-03T07:09:15Z Comparative Data Analytic Approach for Detection of Diabetes Sood, Radhika Information Technology Data mining Diabetes Clustering K-fold cross-validation Imbalanced data Decision-support tool Data science methods have the potential to benefit other scientific fields by shedding new light on common questions. One such task is to help to make predictions on medical data. Diabetes has been one of the oldest known diseases to mankind and yet with the tremendous scientific advances witnessed in the century, medical science cannot claim that it knows all that is needed to be known about the disease. So, it goes without saying Prevention is better than cure. Prevention of diabetes is a hot topic for research in the healthcare community.The purpose of this study is to create a framework for detecting diabetes existence based on the using the attributes available in the PimaIndiansDiabetes2 dataset. After performing model training practice on various models, based on the results we decided to do a comparative study of logistic regression (LR) model, random forest method (RF) and support vector machine (SVM) methods using five-fold cross validation technique. The data is of imbalanced nature, so to overcome this issue Synthetic Minority Over-sampling Technique (SMOTE) is used. SMOTE helps in situations when one class dominates other. The study contributes to the medical world as well as to the data analytics world. The model which proves to be the most reliable can be used by medical specialist to determine the risk and existence of diabetes in the targeted patients. This selected model can be used in future for studying such data without performing prior comparative analysis, thus saving time and resources. This model can be used as a decision support tool by medical practitioners which in turn will not just expedite the decision-making process but will also help to reduce the cost of services by decreasing the usage of time-consuming processes.The data used is already collected data available to research and study. The research focusses on studying the influence of diabetes on the American population of Pima Indians. The population of female of Pima Indians was tested for diabetes in accordance with World Health Organization criteria. This data belongs to the National Institute of Diabetes and Digestive and Kidney Diseases and is part of the UCI database. In the first phase of study, data illustration and variable identification are done. In second phase, models were compared based on sensitivity, specificity and AUC (Area under Curve) values on three models chosen – Logistic regression, SVM and RF after trial and error method on multiple models. In the third phase, it was determined based on the results that which model is best. In forth phase of study, variable importance was determined to understand which variable contributes maximum to the persistence of diabetes. In the last phase, k-means clustering technique was used to determine the performance of the model. Basically, this technique will establish the confidence of the model. In the further sections these techniques will be explained in detail with context to the dataset under research. 2018 English text University of Cincinnati / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544100930937728 http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544100930937728 unrestricted This thesis or dissertation is protected by copyright: all rights reserved. It may not be copied or redistributed beyond the terms of applicable copyright laws. |
collection |
NDLTD |
language |
English |
sources |
NDLTD |
topic |
Information Technology Data mining Diabetes Clustering K-fold cross-validation Imbalanced data Decision-support tool |
spellingShingle |
Information Technology Data mining Diabetes Clustering K-fold cross-validation Imbalanced data Decision-support tool Sood, Radhika Comparative Data Analytic Approach for Detection of Diabetes |
author |
Sood, Radhika |
author_facet |
Sood, Radhika |
author_sort |
Sood, Radhika |
title |
Comparative Data Analytic Approach for Detection of Diabetes |
title_short |
Comparative Data Analytic Approach for Detection of Diabetes |
title_full |
Comparative Data Analytic Approach for Detection of Diabetes |
title_fullStr |
Comparative Data Analytic Approach for Detection of Diabetes |
title_full_unstemmed |
Comparative Data Analytic Approach for Detection of Diabetes |
title_sort |
comparative data analytic approach for detection of diabetes |
publisher |
University of Cincinnati / OhioLINK |
publishDate |
2018 |
url |
http://rave.ohiolink.edu/etdc/view?acc_num=ucin1544100930937728 |
work_keys_str_mv |
AT soodradhika comparativedataanalyticapproachfordetectionofdiabetes |
_version_ |
1719455055001681920 |