Summary: | 碩士 === 國立成功大學 === 醫學資訊研究所 === 101 === Because there are huge data on the web, it will get many duplicate and near-duplicate search results when we search on the web. The motivation of this thesis is that reduce the time of filtering the huge duplicate and near-duplicate information when user search.
In this thesis, we propose a novel clustering method to solve near-duplicate problem. Our method transforms each document to a feature vector, where the weights are terms frequency of each corresponding words. For reducing the dimension of these feature vectors, we used principle component analysis to transform these vectors to another space. After PCA, we used cosine similarity to compute the similarity of each document. And then, we used EM algorithm and Neyman-Pearson hypothesis test to cluster the duplicate documents.
We compared out results with K-means method results. The experiments show that our method is outperformer than K-means method.
|