Scalable Matching and Clustering of Entities with FAMER

Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution fra...

Full description

Bibliographic Details
Main Authors: Alieh Saeedi, Markus Nentwig, Eric Peukert, Erhard Rahm
Format: Article
Language:English
Published: Riga Technical University 2018-10-01
Series:Complex Systems Informatics and Modeling Quarterly
Subjects:
Online Access:https://csimq-journals.rtu.lv/article/view/2407
id doaj-01727c66ab9c4466bb036608c58b7172
record_format Article
spelling doaj-01727c66ab9c4466bb036608c58b71722020-11-25T01:29:38ZengRiga Technical UniversityComplex Systems Informatics and Modeling Quarterly2255-99222018-10-01016618310.7250/csimq.2018-16.041236Scalable Matching and Clustering of Entities with FAMERAlieh Saeedi0Markus Nentwig1Eric Peukert2Erhard Rahm3Database Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigDatabase Group, Department of Computer Science, University of Leipzig, LeipzigDatabase Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigDatabase Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigEntity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.https://csimq-journals.rtu.lv/article/view/2407ClusteringMatchingDistributed processingMulti-source
collection DOAJ
language English
format Article
sources DOAJ
author Alieh Saeedi
Markus Nentwig
Eric Peukert
Erhard Rahm
spellingShingle Alieh Saeedi
Markus Nentwig
Eric Peukert
Erhard Rahm
Scalable Matching and Clustering of Entities with FAMER
Complex Systems Informatics and Modeling Quarterly
Clustering
Matching
Distributed processing
Multi-source
author_facet Alieh Saeedi
Markus Nentwig
Eric Peukert
Erhard Rahm
author_sort Alieh Saeedi
title Scalable Matching and Clustering of Entities with FAMER
title_short Scalable Matching and Clustering of Entities with FAMER
title_full Scalable Matching and Clustering of Entities with FAMER
title_fullStr Scalable Matching and Clustering of Entities with FAMER
title_full_unstemmed Scalable Matching and Clustering of Entities with FAMER
title_sort scalable matching and clustering of entities with famer
publisher Riga Technical University
series Complex Systems Informatics and Modeling Quarterly
issn 2255-9922
publishDate 2018-10-01
description Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.
topic Clustering
Matching
Distributed processing
Multi-source
url https://csimq-journals.rtu.lv/article/view/2407
work_keys_str_mv AT aliehsaeedi scalablematchingandclusteringofentitieswithfamer
AT markusnentwig scalablematchingandclusteringofentitieswithfamer
AT ericpeukert scalablematchingandclusteringofentitieswithfamer
AT erhardrahm scalablematchingandclusteringofentitieswithfamer
_version_ 1725095923987513344