Summary: | Bayesian networks are becoming an increasingly important area for research and have been proposed for real world applications such as medical diagnoses, image recognition, and fraud detection. In all of these applications, accuracy is not sufficient alone, as there are costs involved when errors occur. Hence, this thesis develops new algorithms, referred to as cost-sensitive Bayesian network algorithms that aim to minimise the expected costs due to misclassifications. The study presents a review of existing research on cost-sensitive learning and identifies three common methods for developing cost-sensitive algorithms for decision tree learning. These methods are then utilised to develop three different algorithms for learning cost-sensitive Bayesian networks: (i) an indirect method, where costs are included by changing the data distribution without changing a cost-insensitive algorithm; (ii) a direct method in which an existing cost-insensitive algorithm is altered to take account of cost; and (iii) by using Genetic algorithms to evolve cost-sensitive Bayesian networks. This research explores new algorithms, which are evaluated on 36 benchmark datasets and compared to existing cost-sensitive algorithms such as MetaCost+J48, and MetaCost+BN as well as an existing cost-insensitive Bayesian network algorithm. The obtained results exhibit improvements in comparison to other algorithms in terms of cost, whilst still maintaining accuracy. In our experiment methodology, all experiments are repeated with 10 random trials, and in each trial, the data divided into 75% for training and 25% for testing. The results show that: (i) all three new algorithms perform better than the cost-insensitive Bayesian learning algorithm on all 36 datasets in terms of cost; (ii) the new algorithms, which are based on indirect methods, direct methods, and Genetic algorithms, work better than MetaCost+J48 on 29, 28, and 31 out of the 36 datasets respectively in terms of cost; (iii) the algorithm that utilise an indirect method performs well on imbalanced data compared to our two algorithms on 8 out of the 36 datasets in terms of cost; (iv) the algorithm that is based on a direct method outperform the new algorithms on 13 out of 36 datasets in terms of cost; (v) the evolutionary version of the algorithm is better than the other algorithms, including the use of the direct and indirect methods, on 24 out of the 36 datasets in terms of both costs and accuracy; (vi) all three new algorithms perform better than the MetaCost+BN on all 36 datasets in terms of cost.
|