Identifying the Data Discrepancy Existing in Hadoop Clusters

碩士 === 輔仁大學 === 資訊工程學系碩士班 === 104 === In recent years, cloud computing is developing rapidly in the real of Internet.Among many cloud computing platforms, Hadoop is widely used because of it's stability and performance. It can easiliy handle a large number of files in a very efficient way. Had...

Full description

Bibliographic Details
Main Authors: YU TZU-TING, 游資婷
Other Authors: 葉佐任
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/21290145999940618636
Description
Summary:碩士 === 輔仁大學 === 資訊工程學系碩士班 === 104 === In recent years, cloud computing is developing rapidly in the real of Internet.Among many cloud computing platforms, Hadoop is widely used because of it's stability and performance. It can easiliy handle a large number of files in a very efficient way. Hadoop is a distributed system, Hadoop Distributed File System(HDFS) is the default file system used in Hadoop platform. HDFS consists of a NameNode and multiple DataNodes. NameNode records the file metadata, including file location, file owner, and other related information. DataNodes are the actual places storing all the files. Each file is depleted on several DataNodes in general. However, file contents can still not be retrieved of the NameNode is lost, or all DataNodes storing those files are destroyed at the same file. To fix this problem, we can backup important files on multiple Hadoop cluster. Nevertheless errors could occur during the process of file duplication. We design and implement a scheme to identify the discrepancy between Hadoop cluster so user can fixed dismatch between files duplicated on different Hadoop Clusters.