An exploratory analysis of large health cohort study using Bayesian networks

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006. === Includes bibliographical references (p. 91-98). === Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range o...

Full description

Bibliographic Details
Main Author: Shen, Delin
Other Authors: Peter Szolovits.
Format: Others
Language:English
Published: Massachusetts Institute of Technology 2007
Subjects:
Online Access:http://dspace.mit.edu/handle/1721.1/34478
http://hdl.handle.net/1721.1/34478
id ndltd-MIT-oai-dspace.mit.edu-1721.1-34478
record_format oai_dc
spelling ndltd-MIT-oai-dspace.mit.edu-1721.1-344782019-05-02T16:06:51Z An exploratory analysis of large health cohort study using Bayesian networks Shen, Delin Peter Szolovits. Harvard University--MIT Division of Health Sciences and Technology. Harvard University--MIT Division of Health Sciences and Technology. Harvard University--MIT Division of Health Sciences and Technology. Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006. Includes bibliographical references (p. 91-98). Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range of data over long periods. The wealth of data in such studies may yield important results in addition to the already numerous findings, especially when subjected to newer analytical methods. Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships among variables, using the tools of probability and graph theory, and have been widely used in analyzing dependencies and the interplay between variables. We used BN to perform an exploratory analysis on a rich collection of data from one large health cohort study, the Nurses' Health Study (NHS), with the focus on breast cancer. We explored the data from the NHS using BN to look for breast cancer risk factors, including a group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid as a potential riskfactor after matching on age and number of children. Our results showed for clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI 1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95% CI 0.22-0.97). (cont.) We developed breast cancer risk models using BN. We trained models on 75% of the data, and evaluated them on the remaining. Because of the clinical importance of predicting risks for Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI: 0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively. We also evaluated the calibration performance of the models, and applied a filter to the output to improve the linear relationship between predicted and observed risks using Agglomerative Information Bottleneck clustering without sacrificing much discrimination performance. by Delin Shen. Ph.D. 2007-10-22T16:24:19Z 2007-10-22T16:24:19Z 2006 2006 Thesis http://dspace.mit.edu/handle/1721.1/34478 http://hdl.handle.net/1721.1/34478 70784336 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/34478 http://dspace.mit.edu/handle/1721.1/7582 98 p. application/pdf Massachusetts Institute of Technology
collection NDLTD
language English
format Others
sources NDLTD
topic Harvard University--MIT Division of Health Sciences and Technology.
spellingShingle Harvard University--MIT Division of Health Sciences and Technology.
Shen, Delin
An exploratory analysis of large health cohort study using Bayesian networks
description Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006. === Includes bibliographical references (p. 91-98). === Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range of data over long periods. The wealth of data in such studies may yield important results in addition to the already numerous findings, especially when subjected to newer analytical methods. Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships among variables, using the tools of probability and graph theory, and have been widely used in analyzing dependencies and the interplay between variables. We used BN to perform an exploratory analysis on a rich collection of data from one large health cohort study, the Nurses' Health Study (NHS), with the focus on breast cancer. We explored the data from the NHS using BN to look for breast cancer risk factors, including a group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid as a potential riskfactor after matching on age and number of children. Our results showed for clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI 1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95% CI 0.22-0.97). === (cont.) We developed breast cancer risk models using BN. We trained models on 75% of the data, and evaluated them on the remaining. Because of the clinical importance of predicting risks for Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI: 0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively. We also evaluated the calibration performance of the models, and applied a filter to the output to improve the linear relationship between predicted and observed risks using Agglomerative Information Bottleneck clustering without sacrificing much discrimination performance. === by Delin Shen. === Ph.D.
author2 Peter Szolovits.
author_facet Peter Szolovits.
Shen, Delin
author Shen, Delin
author_sort Shen, Delin
title An exploratory analysis of large health cohort study using Bayesian networks
title_short An exploratory analysis of large health cohort study using Bayesian networks
title_full An exploratory analysis of large health cohort study using Bayesian networks
title_fullStr An exploratory analysis of large health cohort study using Bayesian networks
title_full_unstemmed An exploratory analysis of large health cohort study using Bayesian networks
title_sort exploratory analysis of large health cohort study using bayesian networks
publisher Massachusetts Institute of Technology
publishDate 2007
url http://dspace.mit.edu/handle/1721.1/34478
http://hdl.handle.net/1721.1/34478
work_keys_str_mv AT shendelin anexploratoryanalysisoflargehealthcohortstudyusingbayesiannetworks
AT shendelin exploratoryanalysisoflargehealthcohortstudyusingbayesiannetworks
_version_ 1719034809832964096