Data De-Duplication through Active Learning

Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication....

Full description

Bibliographic Details
Main Author:	Muhivuwomunda, Divine
Format:	Others
Language:	en
Published:	University of Ottawa (Canada) 2013
Subjects:	Computer Science.
Online Access:	http://hdl.handle.net/10393/28859 http://dx.doi.org/10.20381/ruor-19478

id	ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-28859
record_format	oai_dc
spelling	ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-288592018-01-05T19:08:09Z Data De-Duplication through Active Learning Muhivuwomunda, Divine Computer Science. Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set. This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset. The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing. Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference. We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds. 2013-11-07T19:31:29Z 2013-11-07T19:31:29Z 2010 2010 Thesis Source: Masters Abstracts International, Volume: 49-06, page: 3893. http://hdl.handle.net/10393/28859 http://dx.doi.org/10.20381/ruor-19478 en 99 p. University of Ottawa (Canada)
collection	NDLTD
language	en
format	Others
sources	NDLTD
topic	Computer Science.
spellingShingle	Computer Science. Muhivuwomunda, Divine Data De-Duplication through Active Learning
description	Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set. This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset. The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing. Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference. We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds.
author	Muhivuwomunda, Divine
author_facet	Muhivuwomunda, Divine
author_sort	Muhivuwomunda, Divine
title	Data De-Duplication through Active Learning
title_short	Data De-Duplication through Active Learning
title_full	Data De-Duplication through Active Learning
title_fullStr	Data De-Duplication through Active Learning
title_full_unstemmed	Data De-Duplication through Active Learning
title_sort	data de-duplication through active learning
publisher	University of Ottawa (Canada)
publishDate	2013
url	http://hdl.handle.net/10393/28859 http://dx.doi.org/10.20381/ruor-19478
work_keys_str_mv	AT muhivuwomundadivine datadeduplicationthroughactivelearning
_version_	1718602776239407104

Data De-Duplication through Active Learning

Similar Items