Efficient algorithms for fast integration on large data sets from multiple sources

<p>Abstract</p> <p>Background</p> <p>Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive...

Full description

Bibliographic Details
Main Authors: Mi Tian, Rajasekaran Sanguthevar, Aseltine Robert
Format: Article
Language:English
Published: BMC 2012-06-01
Series:BMC Medical Informatics and Decision Making
Online Access:http://www.biomedcentral.com/1472-6947/12/59
id doaj-c6acd040762d45a586d5a094065c7ed3
record_format Article
spelling doaj-c6acd040762d45a586d5a094065c7ed32020-11-25T00:37:53ZengBMCBMC Medical Informatics and Decision Making1472-69472012-06-011215910.1186/1472-6947-12-59Efficient algorithms for fast integration on large data sets from multiple sourcesMi TianRajasekaran SanguthevarAseltine Robert<p>Abstract</p> <p>Background</p> <p>Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.</p> <p>Methods</p> <p>Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.</p> <p>Results</p> <p>We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.</p> <p>Conclusions</p> <p>In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.</p> http://www.biomedcentral.com/1472-6947/12/59
collection DOAJ
language English
format Article
sources DOAJ
author Mi Tian
Rajasekaran Sanguthevar
Aseltine Robert
spellingShingle Mi Tian
Rajasekaran Sanguthevar
Aseltine Robert
Efficient algorithms for fast integration on large data sets from multiple sources
BMC Medical Informatics and Decision Making
author_facet Mi Tian
Rajasekaran Sanguthevar
Aseltine Robert
author_sort Mi Tian
title Efficient algorithms for fast integration on large data sets from multiple sources
title_short Efficient algorithms for fast integration on large data sets from multiple sources
title_full Efficient algorithms for fast integration on large data sets from multiple sources
title_fullStr Efficient algorithms for fast integration on large data sets from multiple sources
title_full_unstemmed Efficient algorithms for fast integration on large data sets from multiple sources
title_sort efficient algorithms for fast integration on large data sets from multiple sources
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2012-06-01
description <p>Abstract</p> <p>Background</p> <p>Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.</p> <p>Methods</p> <p>Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.</p> <p>Results</p> <p>We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.</p> <p>Conclusions</p> <p>In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.</p>
url http://www.biomedcentral.com/1472-6947/12/59
work_keys_str_mv AT mitian efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources
AT rajasekaransanguthevar efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources
AT aseltinerobert efficientalgorithmsforfastintegrationonlargedatasetsfrommultiplesources
_version_ 1725299189774024704