Study on Record Linkage regarding Accuracy and Scalability

The idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices i...

Full description

Bibliographic Details
Main Author: Dannelöv, Johannes
Format: Others
Language:English
Published: Umeå universitet, Institutionen för datavetenskap 2018
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357
id ndltd-UPSALLA1-oai-DiVA.org-umu-155357
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-umu-1553572019-01-15T06:23:12ZStudy on Record Linkage regarding Accuracy and ScalabilityengDannelöv, JohannesUmeå universitet, Institutionen för datavetenskap2018Engineering and TechnologyTeknik och teknologierThe idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices including data cleaning, data management, and business intelligence. Machine learning methods include both unsupervised and supervised learning methods have been applied to address the problem of record linkage. The rise of the big data era has presented new challenges. The trade-off of accuracy and scalability presents a few critical issues for the linkage process. The objective of this study is to present an overview of the state-of-the-art machine learning algorithms for record linkage, a comparison between them, and explore the optimization possibilities of these algorithms based on different similarity functions. The optimization is evaluated in terms of accuracy and scalability. Results showed that supervised classification algorithms, even with a relatively small training set, classified sets of data in shorter time and had approximately the same accuracy as the unsupervised counterparts. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357UMNAD ; 1168application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Engineering and Technology
Teknik och teknologier
spellingShingle Engineering and Technology
Teknik och teknologier
Dannelöv, Johannes
Study on Record Linkage regarding Accuracy and Scalability
description The idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices including data cleaning, data management, and business intelligence. Machine learning methods include both unsupervised and supervised learning methods have been applied to address the problem of record linkage. The rise of the big data era has presented new challenges. The trade-off of accuracy and scalability presents a few critical issues for the linkage process. The objective of this study is to present an overview of the state-of-the-art machine learning algorithms for record linkage, a comparison between them, and explore the optimization possibilities of these algorithms based on different similarity functions. The optimization is evaluated in terms of accuracy and scalability. Results showed that supervised classification algorithms, even with a relatively small training set, classified sets of data in shorter time and had approximately the same accuracy as the unsupervised counterparts.
author Dannelöv, Johannes
author_facet Dannelöv, Johannes
author_sort Dannelöv, Johannes
title Study on Record Linkage regarding Accuracy and Scalability
title_short Study on Record Linkage regarding Accuracy and Scalability
title_full Study on Record Linkage regarding Accuracy and Scalability
title_fullStr Study on Record Linkage regarding Accuracy and Scalability
title_full_unstemmed Study on Record Linkage regarding Accuracy and Scalability
title_sort study on record linkage regarding accuracy and scalability
publisher Umeå universitet, Institutionen för datavetenskap
publishDate 2018
url http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357
work_keys_str_mv AT dannelovjohannes studyonrecordlinkageregardingaccuracyandscalability
_version_ 1718814128042147840