Empirical evaluation of optimization techniques for classification and prediction tasks

M.Ing. (Electrical and Electronic Engineering) === Missing data is an issue which leads to a variety of problems in the analysis and processing of data in datasets in almost every aspect of day−to−day life. Due to this reason missing data and ways of handling this problem have been an area of resear...

Full description

Bibliographic Details
Main Author: Leke, Collins Achepsah
Published: 2014
Subjects:
Online Access:http://hdl.handle.net/10210/9858
Description
Summary:M.Ing. (Electrical and Electronic Engineering) === Missing data is an issue which leads to a variety of problems in the analysis and processing of data in datasets in almost every aspect of day−to−day life. Due to this reason missing data and ways of handling this problem have been an area of research in a variety of disciplines in recent times. This thesis presents a method which is aimed at finding approximations to missing values in a dataset by making use of Genetic Algorithm (GA), Simulated Annealing (SA), Particle Swarm Optimization (PSO), Random Forest (RF), Negative Selection (NS) in combination with auto-associative neural networks, and also provides a comparative analysis of these algorithms. The methods suggested use the optimization algorithms to minimize an error function derived from training an auto-associative neural network during which the interrelationships between the inputs and the outputs are obtained and stored in the weights connecting the different layers of the network. The error function is expressed as the square of the difference between the actual observations and predicted values from an auto-associative neural network. In the event of missing data, all the values of the actual observations are not known hence, the error function is decomposed to depend on the known and unknown variable values. Multi Layer Perceptron (MLP) neural network is employed to train the neural networks using the Scaled Conjugate Gradient (SCG) method. The research primarily focusses on predicting missing data entries from two datasets being the Manufacturing dataset and the Forest Fire dataset. Prediction is a representation of how things will occur in the future based on past occurrences and experiences. The research also focuses on investigating the use of this proposed technique in approximating and classifying missing data with great accuracy from five classification datasets being the Australian Credit, German Credit, Japanese Credit, Heart Disease and Car Evaluation datasets. It also investigates the impact of using different neural network architectures in training the neural network and finding approximations for the missing values, and using the best possible architecture for evaluation purposes. It is revealed in this research that the approximated values for the missing data obtained by applying the proposed models are accurate with a high percentage of correlation between the actual missing values and corresponding approximated values using the proposed models on the Manufacturing dataset ranging between 94.7% and 95.2% with the exception of the Negative Selection algorithm which resulted in a 49.6% correlation coefficient value. On the Forest Fire dataset, it was observed that there was a low percentage correlation between the actual missing values and the corresponding approximated values in the range 0.95% to 4.49% due to the nature of the values of the variables in the dataset. The Negative Selection algorithm on this dataset revealed a negative percentage correlation between the actual values and the approximated values with a value of 100%. Approximations found for missing data are also observed to depend on the particular neural network architecture employed in training the dataset. Further analysis revealed that the Random Forest algorithm on average performed better than the GA, SA, PSO, and NS algorithms yielding the lowest Mean Square Error, Root Mean Square Error, and Mean Absolute Error values. On the other end of the scale was the NS algorithm which produced the highest values for the three error metrics bearing in mind that for these, the lower the values, the better the performance, and vice versa. The evaluation of the algorithms on the classification datasets revealed that the most accurate in classifying and identifying to which of a set of categories a new observation belonged on the basis of the training set of data is the Random Forest algorithm, which yielded the highest AUC percentage values on all of the five classification datasets. The differences between its AUC values and those of the GA, SA, PSO, and NS algorithms were statistically significant, with the most statistically significant differences observed when the AUC values for the Random Forest algorithm were compared to those of the Negative Selection algorithm on all five classification datasets. The GA, SA, and PSO algorithms produced AUC values which when compared against each other on all five classification datasets were not very different. Overall analysis on the datasets considered revealed that the algorithm which performed best in solving both the prediction and classification problems was the Random Forest algorithm as seen by the results obtained. The algorithm on the other end of the scale after comparisons of results was the Negative Selection algorithm which produced the highest error metric values for the prediction problems and the lowest AUC values for the classification problems.