Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching

Fuzzy duplicate detection is an integral part of data cleansing. It consists of finding a set of duplicate records, correctly identifying the original or most representative record and removing the rest. The rate of Internet usage, and data availability and collectability is increasing so we get mor...

Full description

Bibliographic Details
Main Author:	Leland, Robert
Format:	Others
Language:	English
Published:	Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap 2007
Subjects:	ntnudaim SIF2 datateknikk Intelligente systemer
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9642

id	ndltd-UPSALLA1-oai-DiVA.org-ntnu-9642
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-ntnu-96422013-01-08T13:26:37ZDuplicate Detection with PMC -- A Parallel Approach to Pattern MatchingengLeland, RobertNorges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskapInstitutt for datateknikk og informasjonsvitenskap2007ntnudaimSIF2 datateknikkIntelligente systemerFuzzy duplicate detection is an integral part of data cleansing. It consists of finding a set of duplicate records, correctly identifying the original or most representative record and removing the rest. The rate of Internet usage, and data availability and collectability is increasing so we get more and more access to data. A lot of this data is collected from, and entered by humans and this causes noise in the data from typing mistakes, spelling discrepancies, varying schemas, abbreviations, and more. Because of this data cleansing and approximate duplicate detection is now more important than ever. In fuzzy matching records are usually compared by measuring the edit distance between two records. This leads to problems with large data sets where there is a lot of record comparisons to be made so previous solutions have found ways to cut down on the amount of records to be compared. This is often done by creating a key which records are then sorted on with the intention of placing similar records near each other. There are several downsides to this, for example you need to sort and search through potentially large amounts of data several times to catch duplicate data accurately. This project differs in that it presents an approach to the problem which takes advantage of a multiple instruction stream, multiple data stream (MIMD) architecture called a Pattern Matching Chip (PMC), which allows large amounts of parallel character comparisons. This will allow you to do fuzzy matching against the entire data set very quickly, removing the need for clustering and re-arranging of the data which can often lead to omitted duplicates (false negatives). The main point of this paper will be to test the viability of this approach for duplicate detection, examining the performance, potential and scalability of the approach. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9642Local ntnudaim:3375application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	ntnudaim SIF2 datateknikk Intelligente systemer
spellingShingle	ntnudaim SIF2 datateknikk Intelligente systemer Leland, Robert Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
description	Fuzzy duplicate detection is an integral part of data cleansing. It consists of finding a set of duplicate records, correctly identifying the original or most representative record and removing the rest. The rate of Internet usage, and data availability and collectability is increasing so we get more and more access to data. A lot of this data is collected from, and entered by humans and this causes noise in the data from typing mistakes, spelling discrepancies, varying schemas, abbreviations, and more. Because of this data cleansing and approximate duplicate detection is now more important than ever. In fuzzy matching records are usually compared by measuring the edit distance between two records. This leads to problems with large data sets where there is a lot of record comparisons to be made so previous solutions have found ways to cut down on the amount of records to be compared. This is often done by creating a key which records are then sorted on with the intention of placing similar records near each other. There are several downsides to this, for example you need to sort and search through potentially large amounts of data several times to catch duplicate data accurately. This project differs in that it presents an approach to the problem which takes advantage of a multiple instruction stream, multiple data stream (MIMD) architecture called a Pattern Matching Chip (PMC), which allows large amounts of parallel character comparisons. This will allow you to do fuzzy matching against the entire data set very quickly, removing the need for clustering and re-arranging of the data which can often lead to omitted duplicates (false negatives). The main point of this paper will be to test the viability of this approach for duplicate detection, examining the performance, potential and scalability of the approach.
author	Leland, Robert
author_facet	Leland, Robert
author_sort	Leland, Robert
title	Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
title_short	Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
title_full	Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
title_fullStr	Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
title_full_unstemmed	Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching
title_sort	duplicate detection with pmc -- a parallel approach to pattern matching
publisher	Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap
publishDate	2007
url	http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-9642
work_keys_str_mv	AT lelandrobert duplicatedetectionwithpmcaparallelapproachtopatternmatching
_version_	1716520549515001856

Duplicate Detection with PMC -- A Parallel Approach to Pattern Matching

Similar Items