Summary: | The idea of record linkage is to find records that refer to the same entity across different data sources. There are multiple synonyms that refer to record linkage, such as data matching, entity resolution, entity disambiguation, or deduplication etc. Record linkage is useful for lots of practices including data cleaning, data management, and business intelligence. Machine learning methods include both unsupervised and supervised learning methods have been applied to address the problem of record linkage. The rise of the big data era has presented new challenges. The trade-off of accuracy and scalability presents a few critical issues for the linkage process. The objective of this study is to present an overview of the state-of-the-art machine learning algorithms for record linkage, a comparison between them, and explore the optimization possibilities of these algorithms based on different similarity functions. The optimization is evaluated in terms of accuracy and scalability. Results showed that supervised classification algorithms, even with a relatively small training set, classified sets of data in shorter time and had approximately the same accuracy as the unsupervised counterparts.
|