Research and optimization of the Bloom filter algorithm in Hadoop

An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Terada...

Full description

Bibliographic Details
Main Author: Dong, Bing
Format: Others
Language:English
Published: Uppsala universitet, Institutionen för informationsteknologi 2013
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637
id ndltd-UPSALLA1-oai-DiVA.org-uu-196637
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-uu-1966372013-03-12T16:11:20ZResearch and optimization of the Bloom filter algorithm in HadoopengDong, BingUppsala universitet, Institutionen för informationsteknologi2013An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637IT ; 13 020application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
description An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented.
author Dong, Bing
spellingShingle Dong, Bing
Research and optimization of the Bloom filter algorithm in Hadoop
author_facet Dong, Bing
author_sort Dong, Bing
title Research and optimization of the Bloom filter algorithm in Hadoop
title_short Research and optimization of the Bloom filter algorithm in Hadoop
title_full Research and optimization of the Bloom filter algorithm in Hadoop
title_fullStr Research and optimization of the Bloom filter algorithm in Hadoop
title_full_unstemmed Research and optimization of the Bloom filter algorithm in Hadoop
title_sort research and optimization of the bloom filter algorithm in hadoop
publisher Uppsala universitet, Institutionen för informationsteknologi
publishDate 2013
url http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637
work_keys_str_mv AT dongbing researchandoptimizationofthebloomfilteralgorithminhadoop
_version_ 1716578560771096576