Study on Record Linkage regarding Accuracy and Scalability
The idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices i...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Umeå universitet, Institutionen för datavetenskap
2018
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357 |
id |
ndltd-UPSALLA1-oai-DiVA.org-umu-155357 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-umu-1553572019-01-15T06:23:12ZStudy on Record Linkage regarding Accuracy and ScalabilityengDannelöv, JohannesUmeå universitet, Institutionen för datavetenskap2018Engineering and TechnologyTeknik och teknologierThe idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices including data cleaning, data management, and business intelligence. Machine learning methods include both unsupervised and supervised learning methods have been applied to address the problem of record linkage. The rise of the big data era has presented new challenges. The trade-off of accuracy and scalability presents a few critical issues for the linkage process. The objective of this study is to present an overview of the state-of-the-art machine learning algorithms for record linkage, a comparison between them, and explore the optimization possibilities of these algorithms based on different similarity functions. The optimization is evaluated in terms of accuracy and scalability. Results showed that supervised classification algorithms, even with a relatively small training set, classified sets of data in shorter time and had approximately the same accuracy as the unsupervised counterparts. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357UMNAD ; 1168application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Engineering and Technology Teknik och teknologier |
spellingShingle |
Engineering and Technology Teknik och teknologier Dannelöv, Johannes Study on Record Linkage regarding Accuracy and Scalability |
description |
The idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices including data cleaning, data management, and business intelligence. Machine learning methods include both unsupervised and supervised learning methods have been applied to address the problem of record linkage. The rise of the big data era has presented new challenges. The trade-off of accuracy and scalability presents a few critical issues for the linkage process. The objective of this study is to present an overview of the state-of-the-art machine learning algorithms for record linkage, a comparison between them, and explore the optimization possibilities of these algorithms based on different similarity functions. The optimization is evaluated in terms of accuracy and scalability. Results showed that supervised classification algorithms, even with a relatively small training set, classified sets of data in shorter time and had approximately the same accuracy as the unsupervised counterparts. |
author |
Dannelöv, Johannes |
author_facet |
Dannelöv, Johannes |
author_sort |
Dannelöv, Johannes |
title |
Study on Record Linkage regarding Accuracy and Scalability |
title_short |
Study on Record Linkage regarding Accuracy and Scalability |
title_full |
Study on Record Linkage regarding Accuracy and Scalability |
title_fullStr |
Study on Record Linkage regarding Accuracy and Scalability |
title_full_unstemmed |
Study on Record Linkage regarding Accuracy and Scalability |
title_sort |
study on record linkage regarding accuracy and scalability |
publisher |
Umeå universitet, Institutionen för datavetenskap |
publishDate |
2018 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-155357 |
work_keys_str_mv |
AT dannelovjohannes studyonrecordlinkageregardingaccuracyandscalability |
_version_ |
1718814128042147840 |