Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study

Data mining approaches have been increasingly used in recent years in order to find patterns and regularities in large databases. In this study, the C4.5 decision tree approach was used for mining of Gene Logic database, containing biological data. The decision tree approach was used in order to ide...

Full description

Bibliographic Details
Main Author: Rahpeymai, Neda
Format: Others
Language:English
Published: Högskolan i Skövde, Institutionen för datavetenskap 2002
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-710
id ndltd-UPSALLA1-oai-DiVA.org-his-710
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-his-7102018-01-13T05:13:33ZData Mining with Decision Trees in the Gene Logic Database : A Breast Cancer StudyengRahpeymai, NedaHögskolan i Skövde, Institutionen för datavetenskapSkövde : Institutionen för datavetenskap2002Data miningDecision treesC4.5Breast cancerBioinformatics (Computational Biology)Bioinformatik (beräkningsbiologi)Data mining approaches have been increasingly used in recent years in order to find patterns and regularities in large databases. In this study, the C4.5 decision tree approach was used for mining of Gene Logic database, containing biological data. The decision tree approach was used in order to identify the most relevant genes and risk factors involved in breast cancer, in order to separate healthy patients from breast cancer patients in the data sets used. Four different tests were performed for this purpose. Cross validation was performed, for each of the four tests, in order to evaluate the capacity of the decision tree approaches in correctly classifying ‘new’ samples. In the first test, the expression of 108 breast related genes, shown in appendix A, for 75 patients were used as input to the C4.5 algorithm. This test resulted in a decision tree containing only four genes considered to be the most relevant in order to correctly classify patients. Cross validation indicates an average accuracy of 89% in classifying ‘new’ samples. In the second test, risk factor data was used as input. The cross validation result shows an average accuracy of 87% in classifying ‘new’ samples. In the third test, both gene expression data and risk factor data were put together as one input. The cross validation procedure for this approach again indicates an average accuracy of 87% in classifying ‘new’ samples. In the final test, the C4.5 algorithm was used in order to indicate possible signalling pathways involving the four genes identified by the decision tree based on only gene expression data. In some of cases, the C4.5 algorithm found trees suggesting pathways which are supported by the breast cancer literature. Since not all pathways involving the four putative breast cancer genes are known yet, the other suggested pathways should be further analyzed in order to increase their credibility. In summary, this study demonstrates the application of decision tree approaches for the identification of genes and risk factors relevant for the classification of breast cancer patients Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-710application/postscriptinfo:eu-repo/semantics/openAccessapplication/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Data mining
Decision trees
C4.5
Breast cancer
Bioinformatics (Computational Biology)
Bioinformatik (beräkningsbiologi)
spellingShingle Data mining
Decision trees
C4.5
Breast cancer
Bioinformatics (Computational Biology)
Bioinformatik (beräkningsbiologi)
Rahpeymai, Neda
Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
description Data mining approaches have been increasingly used in recent years in order to find patterns and regularities in large databases. In this study, the C4.5 decision tree approach was used for mining of Gene Logic database, containing biological data. The decision tree approach was used in order to identify the most relevant genes and risk factors involved in breast cancer, in order to separate healthy patients from breast cancer patients in the data sets used. Four different tests were performed for this purpose. Cross validation was performed, for each of the four tests, in order to evaluate the capacity of the decision tree approaches in correctly classifying ‘new’ samples. In the first test, the expression of 108 breast related genes, shown in appendix A, for 75 patients were used as input to the C4.5 algorithm. This test resulted in a decision tree containing only four genes considered to be the most relevant in order to correctly classify patients. Cross validation indicates an average accuracy of 89% in classifying ‘new’ samples. In the second test, risk factor data was used as input. The cross validation result shows an average accuracy of 87% in classifying ‘new’ samples. In the third test, both gene expression data and risk factor data were put together as one input. The cross validation procedure for this approach again indicates an average accuracy of 87% in classifying ‘new’ samples. In the final test, the C4.5 algorithm was used in order to indicate possible signalling pathways involving the four genes identified by the decision tree based on only gene expression data. In some of cases, the C4.5 algorithm found trees suggesting pathways which are supported by the breast cancer literature. Since not all pathways involving the four putative breast cancer genes are known yet, the other suggested pathways should be further analyzed in order to increase their credibility. In summary, this study demonstrates the application of decision tree approaches for the identification of genes and risk factors relevant for the classification of breast cancer patients
author Rahpeymai, Neda
author_facet Rahpeymai, Neda
author_sort Rahpeymai, Neda
title Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
title_short Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
title_full Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
title_fullStr Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
title_full_unstemmed Data Mining with Decision Trees in the Gene Logic Database : A Breast Cancer Study
title_sort data mining with decision trees in the gene logic database : a breast cancer study
publisher Högskolan i Skövde, Institutionen för datavetenskap
publishDate 2002
url http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-710
work_keys_str_mv AT rahpeymaineda dataminingwithdecisiontreesinthegenelogicdatabaseabreastcancerstudy
_version_ 1718607863059841024