Dynamic De-duplication Decision in a Hadoop Distributed File System

碩士 === 國立東華大學 === 資訊工程學系 === 101 === Nowadays, data is generated and updated per second and this makes coping with those tremendously fast and multiform amounts of data a heavy challenge. The Hadoop Distributed File System (HDFS) is the first choice solution for most people. However, data is usually...

Full description

Bibliographic Details
Main Authors: Kuo-Zheng Fan, 范國拯
Other Authors: Ruay-Shiung Chang
Format: Others
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/12180320597103126420
Description
Summary:碩士 === 國立東華大學 === 資訊工程學系 === 101 === Nowadays, data is generated and updated per second and this makes coping with those tremendously fast and multiform amounts of data a heavy challenge. The Hadoop Distributed File System (HDFS) is the first choice solution for most people. However, data is usually prevented from being lost with many backups, and HDFS also does this. Obviously, these duplicates occupy a lot of storage space, and this also means that we need to invest sufficient funding in infrastructure. However, this is not a good method for everybody, since it may be unaffordable. Therefore, using De-duplication technology can improve the memory space effectively, which has been gaining increasing attention in many researches, products, and which has also been applied in our implementation. In this paper, we proposed a dynamic De-duplication decision to improve the memory space which runs on HDFS. Under the memory space limitation, the system according to the ability of clusters and the utility of storage space can formulate a proper De-duplication strategy. By doing so, the usage of storage systems can be improved.