Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2)...

Full description

Bibliographic Details
Main Author:	Dinerstein, Jared
Format:	Others
Published:	DigitalCommons@USU 2010
Subjects:	active learning deduplication sensitivity specificity support vector machine svm Computer Sciences
Online Access:	https://digitalcommons.usu.edu/etd/787 https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1783&context=etd

Description
Summary:	This thesis presents two deduplication techniques that overcome the following critical and long-standing weaknesses of rule-based deduplication: (1) traditional rule-based deduplication requires significant manual tuning of the individual rules, including the selection of appropriate thresholds; (2) the accuracy of rule-based deduplication degrades when there are missing data values, significantly reducing the efficacy of the expert-defined deduplication rules. The first technique is a novel rule-level match-score fusion algorithm that employs kernel-machine-based learning to discover the decision threshold for the overall system automatically. The second is a novel clue-level match-score fusion algorithm that addresses both Problem 1 and 2. This unique solution provides robustness against missing/incomplete record data via the selection of a best-fit support vector machine. Empirical evidence shows that the combination of these two novel solutions eliminates two critical long-standing problems in deduplication, providing accurate and robust results in a critical area of rule-based deduplication.

Learning-Based Fusion for Data Deduplication: A Robust and Automated Solution

Similar Items