Summary: | 碩士 === 國立交通大學 === 電機資訊國際學程 === 105 === Big data analytics is the process of examining large data sets that contain a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The healthcare industry is generally data rich. How to predict a new patient’s disease risk from patients’ history data is an important research issue. Related studies used data mining techniques, such as decision tree, clustering and association rules, and recommender systems, such as user-based collaborative filtering (CF), to predict future disease risk of new patients. However, the decision tree suffers from a small change in input data resulting in a large change in the tree which gives poor accuracy when applying to large data sets. For clustering like k-means, it requires to know number of clusters in advance and it does not work well with clusters of different sizes and different densities. So it is difficult to predict the future disease risk for large data sets. For the association rules, the data set used needs to have a relationship between data, so it may be not applicable for all data sets. For the user-based CF, like CARE, if there is a large variation in patients’ diseases, it results in poor accuracy. A representative related work, CFIAC, which is based on the item-based CF, cannot deal with the sparsity problem as it didn’t use any pre-processing method to remove data that have less contribution in making prediction. In this thesis, we propose an effective disease risk prediction system (EDRP) that combines distribution-based clustering with item-based CF. The system is feasible for large data sets and it can perform well at capturing future disease risk for new patients. Experiment results show that the proposed EDRP increases coverage by 15.32% and 24.08% and accuracy by 19.56% and 32.76%, compared to CARE and CFAIC, respectively
|