Model selection, union and assembling in practical data analysis : methods and case study

The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumpt...

Full description

Bibliographic Details
Main Author:	Muhammad, Awaz K.
Other Authors:	Gorban, Alexander ; Mirkes, Evgeny
Published:	University of Leicester 2018
Subjects:	510
Online Access:	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332

id	ndltd-bl.uk-oai-ethos.bl.uk-755332
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-7553322019-03-05T16:03:12ZModel selection, union and assembling in practical data analysis : methods and case studyMuhammad, Awaz K.Gorban, Alexander ; Mirkes, Evgeny2018The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.510University of Leicesterhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332http://hdl.handle.net/2381/42783Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	510
spellingShingle	510 Muhammad, Awaz K. Model selection, union and assembling in practical data analysis : methods and case study
description	The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.
author2	Gorban, Alexander ; Mirkes, Evgeny
author_facet	Gorban, Alexander ; Mirkes, Evgeny Muhammad, Awaz K.
author	Muhammad, Awaz K.
author_sort	Muhammad, Awaz K.
title	Model selection, union and assembling in practical data analysis : methods and case study
title_short	Model selection, union and assembling in practical data analysis : methods and case study
title_full	Model selection, union and assembling in practical data analysis : methods and case study
title_fullStr	Model selection, union and assembling in practical data analysis : methods and case study
title_full_unstemmed	Model selection, union and assembling in practical data analysis : methods and case study
title_sort	model selection, union and assembling in practical data analysis : methods and case study
publisher	University of Leicester
publishDate	2018
url	https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332
work_keys_str_mv	AT muhammadawazk modelselectionunionandassemblinginpracticaldataanalysismethodsandcasestudy
_version_	1718999932956835840

Model selection, union and assembling in practical data analysis : methods and case study

Similar Items