Data De-Duplication through Active Learning

Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication....

Full description

Bibliographic Details
Main Author: Muhivuwomunda, Divine
Format: Others
Language:en
Published: University of Ottawa (Canada) 2013
Subjects:
Online Access:http://hdl.handle.net/10393/28859
http://dx.doi.org/10.20381/ruor-19478
id ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-28859
record_format oai_dc
spelling ndltd-uottawa.ca-oai-ruor.uottawa.ca-10393-288592018-01-05T19:08:09Z Data De-Duplication through Active Learning Muhivuwomunda, Divine Computer Science. Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set. This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset. The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing. Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference. We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds. 2013-11-07T19:31:29Z 2013-11-07T19:31:29Z 2010 2010 Thesis Source: Masters Abstracts International, Volume: 49-06, page: 3893. http://hdl.handle.net/10393/28859 http://dx.doi.org/10.20381/ruor-19478 en 99 p. University of Ottawa (Canada)
collection NDLTD
language en
format Others
sources NDLTD
topic Computer Science.
spellingShingle Computer Science.
Muhivuwomunda, Divine
Data De-Duplication through Active Learning
description Data de-duplication concerns the identification and eventual elimination of records, in a particular dataset, that refer to the same entity without necessarily having the same attribute values, nor the same identifying values. Machine Learning techniques have been used to handle data de-duplication. Active Learning using ensemble learning methods is one such technique. An ensemble learning algorithm is used to create, from the same training set, a set of models that are different. Active Learning then iteratively passes unlabeled pairs of records to the created models for labeling as duplicates, or non-duplicates, and selectively picks the pairs that cause most disagreement among the models. The selected pairs of instances are considered to bring most information gain to the learning process. Active Learning thus continuously teaches a learner to find duplicate instances by providing the learner with a better training set. This thesis evaluates how Active Learning undertakes the task of data de-duplication when Query by Bagging and Query by Boosting algorithms are used. During the evaluation, we investigate the performance of Active Learning in various situations. We study the impact of varying the data size as well as the impact of using different blocking methods, which are methods used to reduce the number of potential duplicates for comparison. We also consider the performance of Active Learning when a synthetic dataset is used versus a real-world dataset. The experimental results show that Active Learning using Query by Bagging performs well on synthetic datasets and only requires a few iterations to generate a good de-duplication function. The size of the dataset does not seem to have much effect on the results. When the experiment is conducted on real-world data, Active Learning using Query by Bagging still performs well, except when the dataset has a significant amount of noise. However, the learning process for real world data is not as smooth compared to when the synthetic data is used. The performance using Canopy Clustering and Bigram Indexing blocking methods were evaluated and the results show better results for the Bigram Indexing. Active Learning using Query by Boosting shows a good performance on synthetic data sets. It also generates good results on real-world data sets. However, the presence of noise in the dataset negatively affects the performance of the learning process. Again, the dataset size does not affect the performance while using Query by Boosting. The evaluation of the de-duplication function using Canopy Clustering and Bigram Indexing does not show any significant difference. We further compare the performance results when using Query by Bagging versus Query by Boosting. First, when compare the two methods using two different blocking methods, the experiment shows that Query by Boosting yields better results for both Canopy Clustering and Bigram Indexing. When considering synthetic versus real-world data, the same observation holds.
author Muhivuwomunda, Divine
author_facet Muhivuwomunda, Divine
author_sort Muhivuwomunda, Divine
title Data De-Duplication through Active Learning
title_short Data De-Duplication through Active Learning
title_full Data De-Duplication through Active Learning
title_fullStr Data De-Duplication through Active Learning
title_full_unstemmed Data De-Duplication through Active Learning
title_sort data de-duplication through active learning
publisher University of Ottawa (Canada)
publishDate 2013
url http://hdl.handle.net/10393/28859
http://dx.doi.org/10.20381/ruor-19478
work_keys_str_mv AT muhivuwomundadivine datadeduplicationthroughactivelearning
_version_ 1718602776239407104