Scalable Matching and Clustering of Entities with FAMER
Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution fra...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Riga Technical University
2018-10-01
|
Series: | Complex Systems Informatics and Modeling Quarterly |
Subjects: | |
Online Access: | https://csimq-journals.rtu.lv/article/view/2407 |
id |
doaj-01727c66ab9c4466bb036608c58b7172 |
---|---|
record_format |
Article |
spelling |
doaj-01727c66ab9c4466bb036608c58b71722020-11-25T01:29:38ZengRiga Technical UniversityComplex Systems Informatics and Modeling Quarterly2255-99222018-10-01016618310.7250/csimq.2018-16.041236Scalable Matching and Clustering of Entities with FAMERAlieh Saeedi0Markus Nentwig1Eric Peukert2Erhard Rahm3Database Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigDatabase Group, Department of Computer Science, University of Leipzig, LeipzigDatabase Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigDatabase Group, Department of Computer Science, University of Leipzig, Leipzig; Competence Center for Scalable Data Services and Solutions Dresden/LeipzigEntity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes.https://csimq-journals.rtu.lv/article/view/2407ClusteringMatchingDistributed processingMulti-source |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Alieh Saeedi Markus Nentwig Eric Peukert Erhard Rahm |
spellingShingle |
Alieh Saeedi Markus Nentwig Eric Peukert Erhard Rahm Scalable Matching and Clustering of Entities with FAMER Complex Systems Informatics and Modeling Quarterly Clustering Matching Distributed processing Multi-source |
author_facet |
Alieh Saeedi Markus Nentwig Eric Peukert Erhard Rahm |
author_sort |
Alieh Saeedi |
title |
Scalable Matching and Clustering of Entities with FAMER |
title_short |
Scalable Matching and Clustering of Entities with FAMER |
title_full |
Scalable Matching and Clustering of Entities with FAMER |
title_fullStr |
Scalable Matching and Clustering of Entities with FAMER |
title_full_unstemmed |
Scalable Matching and Clustering of Entities with FAMER |
title_sort |
scalable matching and clustering of entities with famer |
publisher |
Riga Technical University |
series |
Complex Systems Informatics and Modeling Quarterly |
issn |
2255-9922 |
publishDate |
2018-10-01 |
description |
Entity resolution identifies semantically equivalent entities, e.g. describing the same product or customer. It is especially challenging for Big Data applications where large volumes of data from many sources have to be matched and integrated. We therefore introduce a scalable entity resolution framework called FAMER (FAst Multi-source Entity Resolution system) that is based on Apache Flink for distributed execution and that can holistically match entities from multiple sources. For the latter purpose, FAMER includes multiple clustering schemes that group matching entities from different sources within clusters. In addition to previously known clustering schemes FAMER includes new approaches tailored to multi-source entity resolution. We perform a detailed comparative evaluation of eight clustering schemes for different real-life and synthetically generated datasets. The evaluation considers both the match quality as well as the scalability for different numbers of machines and data sizes. |
topic |
Clustering Matching Distributed processing Multi-source |
url |
https://csimq-journals.rtu.lv/article/view/2407 |
work_keys_str_mv |
AT aliehsaeedi scalablematchingandclusteringofentitieswithfamer AT markusnentwig scalablematchingandclusteringofentitieswithfamer AT ericpeukert scalablematchingandclusteringofentitieswithfamer AT erhardrahm scalablematchingandclusteringofentitieswithfamer |
_version_ |
1725095923987513344 |