An investigation of data privacy and utility using machine learning as a gauge

<p> The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will dea...

Full description

Bibliographic Details
Main Author:	Mivule, Kato
Language:	EN
Published:	Bowie State University 2014
Subjects:	Information Technology\|Information Science\|Computer Science
Online Access:	http://pqdtopen.proquest.com/#viewpdf?dispub=3619387

id	ndltd-PROQUEST-oai-pqdtoai.proquest.com-3619387
record_format	oai_dc
spelling	ndltd-PROQUEST-oai-pqdtoai.proquest.com-36193872014-06-19T04:11:33Z An investigation of data privacy and utility using machine learning as a gauge Mivule, Kato Information Technology\|Information Science\|Computer Science <p> The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machine learning classification, and the generation of synthetic data sets. Hence, data privacy and utility preservation using machine learning classification as a gauge is the central focus of this study. Many organizations that transact in large amounts of data have to comply with state, federal, and international laws to guarantee that the privacy of individuals and other sensitive data is not compromised. Yet at some point during the data privacy process, data loses its utility - a measure of how useful a privatized dataset is to the user of that dataset. Data privacy researchers have documented that attaining an optimal balance between data privacy and utility is an NP-hard challenge, thus an intractable problem. Therefore we propose the classification error gauge (x-CEG) approach, a data utility quantification concept that employs machine learning classification techniques to gauge data utility based on the classification error. In the initial phase of this proposed approach, a data privacy algorithm such as differential privacy, Gaussian noise addition, generalization, and or k-anonymity is applied on a dataset for confidentiality, generating a privatized synthetic data set. The privatized synthetic data set is then passed through a machine learning classifier, after which the classification error is measured. If the classification error is lower or equal to a set threshold, then better utility might be achieved, otherwise, adjustment to the data privacy parameters is made and then the refined synthetic data set is sent to the machine learning classifier; the process repeats until the error threshold is reached. Additionally, this study presents the Comparative x-CEG concept, in which a privatized synthetic data set is passed through a series of classifiers, each of which returns a classification error, and the classifier with the lowest classification error is chosen after parameter adjustments, an indication of better data utility. Preliminary results from this investigation show that fine-tuning parameters in data privacy procedures, for example in the case of differential privacy, and increasing weak learners in the ensemble classifier for instance, might lead to lower classification error, thus better utility. Furthermore, this study explores the application of this approach by employing signal processing techniques in the generation of privatized synthetic data sets and improving data utility. This dissertation presents theoretical and empirical work examining various data privacy and utility methodologies using machine learning classification as a gauge. Similarly this study presents a resourceful approach in the generation of privatized synthetic data sets, and an innovative conceptual framework for the data privacy engineering process.</p> Bowie State University 2014-06-18 00:00:00.0 thesis http://pqdtopen.proquest.com/#viewpdf?dispub=3619387 EN
collection	NDLTD
language	EN
sources	NDLTD
topic	Information Technology\|Information Science\|Computer Science
spellingShingle	Information Technology\|Information Science\|Computer Science Mivule, Kato An investigation of data privacy and utility using machine learning as a gauge
description	<p> The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machine learning classification, and the generation of synthetic data sets. Hence, data privacy and utility preservation using machine learning classification as a gauge is the central focus of this study. Many organizations that transact in large amounts of data have to comply with state, federal, and international laws to guarantee that the privacy of individuals and other sensitive data is not compromised. Yet at some point during the data privacy process, data loses its utility - a measure of how useful a privatized dataset is to the user of that dataset. Data privacy researchers have documented that attaining an optimal balance between data privacy and utility is an NP-hard challenge, thus an intractable problem. Therefore we propose the classification error gauge (x-CEG) approach, a data utility quantification concept that employs machine learning classification techniques to gauge data utility based on the classification error. In the initial phase of this proposed approach, a data privacy algorithm such as differential privacy, Gaussian noise addition, generalization, and or k-anonymity is applied on a dataset for confidentiality, generating a privatized synthetic data set. The privatized synthetic data set is then passed through a machine learning classifier, after which the classification error is measured. If the classification error is lower or equal to a set threshold, then better utility might be achieved, otherwise, adjustment to the data privacy parameters is made and then the refined synthetic data set is sent to the machine learning classifier; the process repeats until the error threshold is reached. Additionally, this study presents the Comparative x-CEG concept, in which a privatized synthetic data set is passed through a series of classifiers, each of which returns a classification error, and the classifier with the lowest classification error is chosen after parameter adjustments, an indication of better data utility. Preliminary results from this investigation show that fine-tuning parameters in data privacy procedures, for example in the case of differential privacy, and increasing weak learners in the ensemble classifier for instance, might lead to lower classification error, thus better utility. Furthermore, this study explores the application of this approach by employing signal processing techniques in the generation of privatized synthetic data sets and improving data utility. This dissertation presents theoretical and empirical work examining various data privacy and utility methodologies using machine learning classification as a gauge. Similarly this study presents a resourceful approach in the generation of privatized synthetic data sets, and an innovative conceptual framework for the data privacy engineering process.</p>
author	Mivule, Kato
author_facet	Mivule, Kato
author_sort	Mivule, Kato
title	An investigation of data privacy and utility using machine learning as a gauge
title_short	An investigation of data privacy and utility using machine learning as a gauge
title_full	An investigation of data privacy and utility using machine learning as a gauge
title_fullStr	An investigation of data privacy and utility using machine learning as a gauge
title_full_unstemmed	An investigation of data privacy and utility using machine learning as a gauge
title_sort	investigation of data privacy and utility using machine learning as a gauge
publisher	Bowie State University
publishDate	2014
url	http://pqdtopen.proquest.com/#viewpdf?dispub=3619387
work_keys_str_mv	AT mivulekato aninvestigationofdataprivacyandutilityusingmachinelearningasagauge AT mivulekato investigationofdataprivacyandutilityusingmachinelearningasagauge
_version_	1716704454783270912

An investigation of data privacy and utility using machine learning as a gauge

Similar Items