Research and optimization of the Bloom filter algorithm in Hadoop
An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Terada...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
Uppsala universitet, Institutionen för informationsteknologi
2013
|
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637 |
id |
ndltd-UPSALLA1-oai-DiVA.org-uu-196637 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-uu-1966372013-03-12T16:11:20ZResearch and optimization of the Bloom filter algorithm in HadoopengDong, BingUppsala universitet, Institutionen för informationsteknologi2013An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637IT ; 13 020application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
description |
An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented. |
author |
Dong, Bing |
spellingShingle |
Dong, Bing Research and optimization of the Bloom filter algorithm in Hadoop |
author_facet |
Dong, Bing |
author_sort |
Dong, Bing |
title |
Research and optimization of the Bloom filter algorithm in Hadoop |
title_short |
Research and optimization of the Bloom filter algorithm in Hadoop |
title_full |
Research and optimization of the Bloom filter algorithm in Hadoop |
title_fullStr |
Research and optimization of the Bloom filter algorithm in Hadoop |
title_full_unstemmed |
Research and optimization of the Bloom filter algorithm in Hadoop |
title_sort |
research and optimization of the bloom filter algorithm in hadoop |
publisher |
Uppsala universitet, Institutionen för informationsteknologi |
publishDate |
2013 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637 |
work_keys_str_mv |
AT dongbing researchandoptimizationofthebloomfilteralgorithminhadoop |
_version_ |
1716578560771096576 |