A Perturbative Decision Making Framework for Distributed Sensitive Data
Computer and Information Science === Ph.D. === In various business domains, intelligence garnered from data owned by peer institutions can provide useful information. But, due to regulations, privacy concerns and legal ramifications, peer institutions are reluctant to share raw data. For example, in...
Main Author: | |
---|---|
Other Authors: | |
Format: | Others |
Language: | English |
Published: |
Temple University Libraries
2014
|
Subjects: | |
Online Access: | http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/269109 |
id |
ndltd-TEMPLE-oai-cdm16002.contentdm.oclc.org-p245801coll10-269109 |
---|---|
record_format |
oai_dc |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Computer science; |
spellingShingle |
Computer science; Mathew, George A Perturbative Decision Making Framework for Distributed Sensitive Data |
description |
Computer and Information Science === Ph.D. === In various business domains, intelligence garnered from data owned by peer institutions can provide useful information. But, due to regulations, privacy concerns and legal ramifications, peer institutions are reluctant to share raw data. For example, in medical domain, HIPAA regulations, Personally Identifiable Information and privacy issues are impediments to data sharing. However, intelligence can be learned from distributed data sets if their key characteristics are shared among desired parties. In scenarios where samples are rare locally, but adequately available collectively from other sites, sharing key statistics about the data may be sufficient to make proper decisions. The objective of this research is to provide a framework in a distributed environment that helps decision-making using statistics of data from participating sites; thereby eliminating the need for raw data to be shared outside the institution. Distributed ID3-based Decision Tree (DIDT) model building is proposed for effectively building a Decision Support System based on labeled data from distributed sites. The framework includes a query mechanism, a global schema generation process brokered by a clearing-house (CH), crosstable matrices generation by participating sites and entropy calculation (for test) using aggregate information from the crosstable matrices by CH. Empirical evaluations were done using synthetic and real data sets. Due to local data policies, participating sites may place restrictions on attribute release. The concept of "constraint graphs" is introduced as an out of band high level filtering for data in transit. Constraint graphs can be used to implement various data transformations including attributes exclusions. Experiments conducted using constraint graphs yielded results consistent with baseline results. In the case of medical data, it was shown that communication costs for DIDT can be contained by auto-reduction of features with predefined thresholds for near constant attributes. In another study, it was shown that hospitals with insufficient data to build local prediction models were able to collaboratively build a common prediction model with better accuracy using DIDT. This prediction model also reduced the number of incorrectly classified patients. A natural follow up question is: Can a hospital with sufficiently large number of instances provide a prediction model to a hospital with insufficient data? This was investigated and the signature of a large hospital dataset that can provide such a model is presented. It is also shown that the error rates of such a model is not statistically significant compared to the collaboratively built model. When rare instances of data occur in local database, it is quite valuable to draw conclusions collectively from such occurrences in other sites. However, in such situations, there will be huge imbalance in classes among the relevant base population. We present a system that can collectively build a distributed classification model without the need for raw data from each site in the case of imbalanced data. The system uses a voting ensemble of experts for the decision model, where each expert is built using DIDT on selective data generated by oversampling of minority class and undersampling of majority class data. The imbalance condition can be detected and the number of experts needed for the ensemble can be determined by the system. === Temple University--Theses |
author2 |
Obradovic, Zoran; |
author_facet |
Obradovic, Zoran; Mathew, George |
author |
Mathew, George |
author_sort |
Mathew, George |
title |
A Perturbative Decision Making Framework for Distributed Sensitive Data |
title_short |
A Perturbative Decision Making Framework for Distributed Sensitive Data |
title_full |
A Perturbative Decision Making Framework for Distributed Sensitive Data |
title_fullStr |
A Perturbative Decision Making Framework for Distributed Sensitive Data |
title_full_unstemmed |
A Perturbative Decision Making Framework for Distributed Sensitive Data |
title_sort |
perturbative decision making framework for distributed sensitive data |
publisher |
Temple University Libraries |
publishDate |
2014 |
url |
http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/269109 |
work_keys_str_mv |
AT mathewgeorge aperturbativedecisionmakingframeworkfordistributedsensitivedata AT mathewgeorge perturbativedecisionmakingframeworkfordistributedsensitivedata |
_version_ |
1718452148900986880 |
spelling |
ndltd-TEMPLE-oai-cdm16002.contentdm.oclc.org-p245801coll10-2691092017-05-24T14:34:18Z Mathew, George A Perturbative Decision Making Framework for Distributed Sensitive Data 2014 Computer and Information Science Ph.D. In various business domains, intelligence garnered from data owned by peer institutions can provide useful information. But, due to regulations, privacy concerns and legal ramifications, peer institutions are reluctant to share raw data. For example, in medical domain, HIPAA regulations, Personally Identifiable Information and privacy issues are impediments to data sharing. However, intelligence can be learned from distributed data sets if their key characteristics are shared among desired parties. In scenarios where samples are rare locally, but adequately available collectively from other sites, sharing key statistics about the data may be sufficient to make proper decisions. The objective of this research is to provide a framework in a distributed environment that helps decision-making using statistics of data from participating sites; thereby eliminating the need for raw data to be shared outside the institution. Distributed ID3-based Decision Tree (DIDT) model building is proposed for effectively building a Decision Support System based on labeled data from distributed sites. The framework includes a query mechanism, a global schema generation process brokered by a clearing-house (CH), crosstable matrices generation by participating sites and entropy calculation (for test) using aggregate information from the crosstable matrices by CH. Empirical evaluations were done using synthetic and real data sets. Due to local data policies, participating sites may place restrictions on attribute release. The concept of "constraint graphs" is introduced as an out of band high level filtering for data in transit. Constraint graphs can be used to implement various data transformations including attributes exclusions. Experiments conducted using constraint graphs yielded results consistent with baseline results. In the case of medical data, it was shown that communication costs for DIDT can be contained by auto-reduction of features with predefined thresholds for near constant attributes. In another study, it was shown that hospitals with insufficient data to build local prediction models were able to collaboratively build a common prediction model with better accuracy using DIDT. This prediction model also reduced the number of incorrectly classified patients. A natural follow up question is: Can a hospital with sufficiently large number of instances provide a prediction model to a hospital with insufficient data? This was investigated and the signature of a large hospital dataset that can provide such a model is presented. It is also shown that the error rates of such a model is not statistically significant compared to the collaboratively built model. When rare instances of data occur in local database, it is quite valuable to draw conclusions collectively from such occurrences in other sites. However, in such situations, there will be huge imbalance in classes among the relevant base population. We present a system that can collectively build a distributed classification model without the need for raw data from each site in the case of imbalanced data. The system uses a voting ensemble of experts for the decision model, where each expert is built using DIDT on selective data generated by oversampling of minority class and undersampling of majority class data. The imbalance condition can be detected and the number of experts needed for the ensemble can be determined by the system. Obradovic, Zoran; Yates, Alexander P.; Dragut, Eduard Constantin; Davey, Adam; Computer science; Temple University Libraries Dissertations Application/PDF 100 English TETDEDXMathew-temple-0225E-11698 The author has granted Temple University a limited, non-exclusive, royalty-free license to reproduce his or her dissertation, in whole or in part, in electronic or paper form and to make it available to the general public at no charge. This permission is granted in addition to rights granted to ProQuest. The author retains all other rights. Temple University--Theses 2061043 Bytes http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/269109 |