Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2)...

Full description

Bibliographic Details
Main Author: Dinerstein, Jared
Format: Others
Published: DigitalCommons@USU 2010
Subjects:
svm
Online Access:https://digitalcommons.usu.edu/etd/787
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1783&context=etd
id ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-1783
record_format oai_dc
spelling ndltd-UTAHS-oai-digitalcommons.usu.edu-etd-17832019-10-13T06:10:57Z Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution Dinerstein, Jared This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules. The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication. 2010-12-01T08:00:00Z text application/pdf https://digitalcommons.usu.edu/etd/787 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1783&context=etd Copyright for this work is held by the author. Transmission or reproduction of materials protected by copyright beyond that allowed by fair use requires the written permission of the copyright owners. Works not in the public domain cannot be commercially exploited without permission of the copyright owner. Responsibility for any use rests exclusively with the user. For more information contact Andrew Wesolek (andrew.wesolek@usu.edu). All Graduate Theses and Dissertations DigitalCommons@USU active learning deduplication sensitivity specificity support vector machine svm Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic active learning
deduplication
sensitivity
specificity
support vector machine
svm
Computer Sciences
spellingShingle active learning
deduplication
sensitivity
specificity
support vector machine
svm
Computer Sciences
Dinerstein, Jared
Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
description This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules. The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication.
author Dinerstein, Jared
author_facet Dinerstein, Jared
author_sort Dinerstein, Jared
title Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
title_short Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
title_full Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
title_fullStr Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
title_full_unstemmed Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution
title_sort learning-based fusion for data deduplication: a robust and automated solution
publisher DigitalCommons@USU
publishDate 2010
url https://digitalcommons.usu.edu/etd/787
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1783&context=etd
work_keys_str_mv AT dinersteinjared learningbasedfusionfordatadeduplicationarobustandautomatedsolution
_version_ 1719267515009335296