Summary: | 碩士 === 長庚大學 === 資訊管理學研究所 === 97 === The mortality rate of breast cancer in Taiwan has been increasing year by year. In Taiwan, breast cancer is a major threat that causes many young women dead. Currently, the diagnosis of breast cancer is invasive, time-costing, and high cost. Therefore, this thesis proposes to use gene expression data of patients and data mining technology to help the identification of breast cancer subclasses and related analysis.
In this study, EM clustering algorithm is repetitively used to cluster gene expression data from breast cancer patients into a variety of cluster numbers, and each patient sample is transformed into a probability tuple for each given cluster number respectively. Each probability of the tuple represents the degree that a sample belongs to a cluster for a given cluster number. Then, decision tree induction algorithm J48 is used on the transformed data to evaluate their performance for a given cluster number in order to find the optimal cluster number.
By using the optimal clustering result, cluster profiles can be derived to analyze the characteristics of breast cancer subclasses. Since the development of a cancer is a continuous process, each cluster profile might represent a typical sub-type of a cancer subclass. Additionally, by exploiting the optimal EM clustering result, the candidate genes related to breast cancer are inferred, and the rules helping understand and interpret the identification of breast cancer subclasses are produced.
|