Data Clustering with Complete Must-Link Constraints

碩士 === 國立臺灣科技大學 === 工業管理系 === 102 === This research aims to develop an integrated method to solve a special but not uncommon constrained clustering problem constructed by Complete Must-Link (CML) constraints. Constrained clustering analysis is a semi-supervised learning to accommodate the informatio...

Full description

Bibliographic Details
Main Author: Maisyatus Suadaa Irfana
Other Authors: Chao-Lung Yang
Format: Others
Language:en_US
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/qrdgn9
Description
Summary:碩士 === 國立臺灣科技大學 === 工業管理系 === 102 === This research aims to develop an integrated method to solve a special but not uncommon constrained clustering problem constructed by Complete Must-Link (CML) constraints. Constrained clustering analysis is a semi-supervised learning to accommodate the information while it is available, to improve efficiency and purity of clustering. CML clustering problem can be considered as aggregating pre-defined data groups. Through the transitive closure process of data aggregation, the data of each group is replaced by their centroid for clustering analysis. This causes information missing issue which means the data distribution or shape of original group is omitted, especially when the group is overlapped with each other. In this research, in order to overcome this problem, a new method named PCA-CML is proposed for CML constrained clustering problem. The principal component analysis (PCA) which provides the supplemental information describing original partition blocks is suggested to be included in the distance matrix of the constrained clustering algorithm if they are overlapped each other. Overlapped ratio is invented to determine whether CML data partitions are overlapped or not. We test the proposed algorithm using the simulated dataset that consists of overlapped and non-overlapped dataset, and real-world dataset containing cartridge quality information. From the experimental result, we can conclude that the proposed algorithm can alleviate missing information issue in CML constrained clustering when pre-defined CML partitions are overlapped.