Model selection, union and assembling in practical data analysis : methods and case study

The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumpt...

Full description

Bibliographic Details
Main Author: Muhammad, Awaz K.
Other Authors: Gorban, Alexander ; Mirkes, Evgeny
Published: University of Leicester 2018
Subjects:
510
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332
id ndltd-bl.uk-oai-ethos.bl.uk-755332
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7553322019-03-05T16:03:12ZModel selection, union and assembling in practical data analysis : methods and case studyMuhammad, Awaz K.Gorban, Alexander ; Mirkes, Evgeny2018The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.510University of Leicesterhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332http://hdl.handle.net/2381/42783Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 510
spellingShingle 510
Muhammad, Awaz K.
Model selection, union and assembling in practical data analysis : methods and case study
description The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.
author2 Gorban, Alexander ; Mirkes, Evgeny
author_facet Gorban, Alexander ; Mirkes, Evgeny
Muhammad, Awaz K.
author Muhammad, Awaz K.
author_sort Muhammad, Awaz K.
title Model selection, union and assembling in practical data analysis : methods and case study
title_short Model selection, union and assembling in practical data analysis : methods and case study
title_full Model selection, union and assembling in practical data analysis : methods and case study
title_fullStr Model selection, union and assembling in practical data analysis : methods and case study
title_full_unstemmed Model selection, union and assembling in practical data analysis : methods and case study
title_sort model selection, union and assembling in practical data analysis : methods and case study
publisher University of Leicester
publishDate 2018
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332
work_keys_str_mv AT muhammadawazk modelselectionunionandassemblinginpracticaldataanalysismethodsandcasestudy
_version_ 1718999932956835840