Research and optimization of the Bloom filter algorithm in Hadoop

An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Terada...

Full description

Bibliographic Details
Main Author:	Dong, Bing
Format:	Others
Language:	English
Published:	Uppsala universitet, Institutionen för informationsteknologi 2013
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637

id	ndltd-UPSALLA1-oai-DiVA.org-uu-196637
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-uu-1966372013-03-12T16:11:20ZResearch and optimization of the Bloom filter algorithm in HadoopengDong, BingUppsala universitet, Institutionen för informationsteknologi2013An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637IT ; 13 020application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
description	An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data. During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF). Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented.
author	Dong, Bing
spellingShingle	Dong, Bing Research and optimization of the Bloom filter algorithm in Hadoop
author_facet	Dong, Bing
author_sort	Dong, Bing
title	Research and optimization of the Bloom filter algorithm in Hadoop
title_short	Research and optimization of the Bloom filter algorithm in Hadoop
title_full	Research and optimization of the Bloom filter algorithm in Hadoop
title_fullStr	Research and optimization of the Bloom filter algorithm in Hadoop
title_full_unstemmed	Research and optimization of the Bloom filter algorithm in Hadoop
title_sort	research and optimization of the bloom filter algorithm in hadoop
publisher	Uppsala universitet, Institutionen för informationsteknologi
publishDate	2013
url	http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-196637
work_keys_str_mv	AT dongbing researchandoptimizationofthebloomfilteralgorithminhadoop
_version_	1716578560771096576

Research and optimization of the Bloom filter algorithm in Hadoop

Similar Items