Summary: | Machine learning algorithms are often faced with several data related problems. Real-world datasets come in various types and dimensions, each of which constitute some form of data related problems; moreover, they often contain irrelevant or noisy features. As a result of these, different data related problems require different techniques for the classification process. In this paper, some data related problems of interest are replicated in different synthetic datasets in order to investigate and evaluate the performance of a range of learning algorithms. Specifically, the data problems studied in this research are: datasets with varying inter class distances (classes are separated by different amounts); datasets with classes having different input relevance; datasets with classes defined by multiple features and by multiple underlying pattern; datasets with increasing number of noisy features; and datasets with varying amplitudes of noisy features. Also, datasets with combination of some of the problems were also synthesized. These datasets were then used to measure and validate the performance of a number of selected classification algorithms. The results of the experimental investigations show that the GNG had the best performance on datasets with varying inter class distances while DL performed best on the other datasets of different data problems.
|