Summary: | 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 89 === This thesis discusses the effect of employing the chi-square statistic as the similarity measure in clustering transactional data sets. The motivation behind this study is to propose a similarity measure that provides more mathematical insights with respect to clustering results than the existing similarity measures. One common problem of existing clustering algorithms is that clustering quality is highly dependent on certain parameters set by the user. The parameters to be set by the user may even include the number of clusters in the output. Aimed at tackling this problem, a similarity measure based on the chi-square statistic is proposed in this thesis. This similarity measure, when combined with the complete-link hierarchical clustering algorithm, features several advantages. First, the user does not need to specify the number of clusters to be outputted. The user only needs to specify the level of statistical significance beyond which two objects are eligible to be clustered. Then, the clustering algorithm will automatically figure out the number of clusters that should be present in the output. The second advantage of the proposed approach is that each cluster identified has a strong statistical sense. The complete-link algorithm guarantees that the similarity between each pair of objects in a cluster exceeds a statistical significance threshold. The third advantage is that experimental results reveal that the proposed approach generally achieves better clustering quality than the existing algorithms.
|