Model selection, union and assembling in practical data analysis : methods and case study

The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumpt...

Full description

Bibliographic Details
Main Author: Muhammad, Awaz K.
Other Authors: Gorban, Alexander ; Mirkes, Evgeny
Published: University of Leicester 2018
Subjects:
510
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.755332
Description
Summary:The main problem in KDD (Knowledge Discovery and Data Mining) is always two-fold: we have to discover knowledge in real data and we need to develop methods for KDD. This thesis is also two-fold. First, I participated in the support and maintenance of the project ‘Personality traits and drug consumption’. The real data from almost 2000 respondents have been analysed. My role was in data analysis and risk assessment. The central problem is in the search and validation of psychological predictors of consumption of different drugs. Eight data mining algorithms were used for user/nonuser classification: decision trees, random forests, k-nearest neighbours, linear discriminant analysis, Gaussian mixtures, probability density function estimation by radial basis functions, logistic regression, and naïve Bayes. Correlation analysis based on the Pearson’s correlation coefficient and on relative information gain revealed the existence of groups of drugs with strongly correlated consumption. Three correlation pleiades were identified. Classifiers with sensitivity and specificity being greater than 70% for almost all classification tasks were obtained. Secondly, several new methods and approaches to feature selection were proposed and tested on the drug consumption database and on several other publicly available databases. These methods include ‘double Kaiser selection’ for selection of the main factors (principal components) and main attributes. Consideration of each attribute as a distribution on factors allowed us to apply any Kaiser rule for feature selection as well. We developed a methodology for creation and utilisation controllable multicollinearity. Multicollinearity can be useful because it allows to correct mistakes in data and to evaluate missed data. It is undesirable because many statistical tasks become ill-conditional. Alternative attribute sets approach (AASA) can determine several sets of relevant attributes that can be used to solve original problems separately. We tested AASA on several classification problems. We demonstrated that this methodology could be more accurate than the best traditional feature selection methods.