Hash-based Eventual Consistency to Scale the HDFS Block Report

The architecture of the distributed hierarchical file system HDFS imposes limitations on its scalability. All metadata is stored in-memory on a single machine, and in practice, this limits the cluster size to about 4000 servers. Larger HDFS clusters must resort to namespace federation which divides...

Full description

Bibliographic Details
Main Author:	Bonds, August
Format:	Others
Language:	English
Published:	KTH, Skolan för informations- och kommunikationsteknik (ICT) 2017
Subjects:	Computer and Information Sciences Data- och informationsvetenskap
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-222363

id	ndltd-UPSALLA1-oai-DiVA.org-kth-222363
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2223632018-02-08T05:18:36ZHash-based Eventual Consistency to Scale the HDFS Block ReportengBonds, AugustKTH, Skolan för informations- och kommunikationsteknik (ICT)2017Computer and Information SciencesData- och informationsvetenskapThe architecture of the distributed hierarchical file system HDFS imposes limitations on its scalability. All metadata is stored in-memory on a single machine, and in practice, this limits the cluster size to about 4000 servers. Larger HDFS clusters must resort to namespace federation which divides the filesystem into isolated volumes and changes the semantics of cross-volume filesystem operations (for example, file move becomes a non-atomic combination of copy and delete). Ideally, organizations want to consolidate their data in as few clusters and namespaces as possible to avoid such issues and increase operating efficiency, utility, and maintenance. HopsFS, a new distribution of HDFS developed at KTH, uses an in-memory distributed database for storing metadata. It scales to 10k nodes and has shown that in principle it can support clusters of at least 15 times the size of traditional non-federated HDFS clusters. However, an eventually consistent data loss protection mechanism in HDFS, called the Block Report protocol, prevents HopsFS from reaching its full potential. This thesis provides a solution to scaling the Block Report protocol for HopsFS that uses an incremental, hash-based eventual consistency mechanism to avoid duplicated work. In the average case, our simulations indicate that the solution can reduce the load on the database by an order of magnitude at the cost of less than 10 percent overhead on file mutations while performing similarly to the old solution in the worst case. Det distribuerade, hierarkiska filsystemet Apache HDFS arkitektur begränsar dess skalbarhet. All metadata lagras i minnet i ett av klustrets noder, och i praktiken begränsar detta ett HDFS-klusters storlek till ungefär 4000 noder. Större kluster tvingas partitionera filsystemet i isolerade delar, vilket förändrar beteendet vid operationer som korsar partitionens gränser (exempelvis fil-flytter blir ickeatomära kombinationer av kopiera och radera). I idealfallet kan organisationer sammanslå alla sina lagringslösningar i ett och samma filträd för att undvika sådana beteendeförändringar och därför minska administrationen, samt öka användningen av den hårdvara de väljer att behålla. HopsFS är en ny utgåva av Apache HDFS, utvecklad på KTH, som använder en minnesbaserad distribuerad databaslösning för att lagra metadata. Lösningen kan hantera en klusterstorlek på 10000 noder och har visat att det i princip kan stöda klusterstorlekar på upp till femton gånger Apache HDFS. Ett av de hinder som kvarstår för att HopsFS ska kunna nå dessa nivåer är en så-småningom-konsekvent algoritm för dataförlustskydd i Apache HDFS som kallas Block Report. Detta arbete föreslår en lösning för att öka skalbarheten i HDFS Block Report som använder sig av en hash-baserad så-småningom-konsekvent mekanism för att undvika dubbelt arbete. Simuleringar indikerar att den nya lösningen i genomsnitt kan minska trycket på databasen med en hel storleksordning, till en prestandakostnad om mindre än tio procent på filsystemets vanliga operationer, medan databasanvändningen i värsta-fallet är jämförbart med den gamla lösningen. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-222363TRITA-ICT-EX ; 2017:150application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer and Information Sciences Data- och informationsvetenskap
spellingShingle	Computer and Information Sciences Data- och informationsvetenskap Bonds, August Hash-based Eventual Consistency to Scale the HDFS Block Report
description	The architecture of the distributed hierarchical file system HDFS imposes limitations on its scalability. All metadata is stored in-memory on a single machine, and in practice, this limits the cluster size to about 4000 servers. Larger HDFS clusters must resort to namespace federation which divides the filesystem into isolated volumes and changes the semantics of cross-volume filesystem operations (for example, file move becomes a non-atomic combination of copy and delete). Ideally, organizations want to consolidate their data in as few clusters and namespaces as possible to avoid such issues and increase operating efficiency, utility, and maintenance. HopsFS, a new distribution of HDFS developed at KTH, uses an in-memory distributed database for storing metadata. It scales to 10k nodes and has shown that in principle it can support clusters of at least 15 times the size of traditional non-federated HDFS clusters. However, an eventually consistent data loss protection mechanism in HDFS, called the Block Report protocol, prevents HopsFS from reaching its full potential. This thesis provides a solution to scaling the Block Report protocol for HopsFS that uses an incremental, hash-based eventual consistency mechanism to avoid duplicated work. In the average case, our simulations indicate that the solution can reduce the load on the database by an order of magnitude at the cost of less than 10 percent overhead on file mutations while performing similarly to the old solution in the worst case. === Det distribuerade, hierarkiska filsystemet Apache HDFS arkitektur begränsar dess skalbarhet. All metadata lagras i minnet i ett av klustrets noder, och i praktiken begränsar detta ett HDFS-klusters storlek till ungefär 4000 noder. Större kluster tvingas partitionera filsystemet i isolerade delar, vilket förändrar beteendet vid operationer som korsar partitionens gränser (exempelvis fil-flytter blir ickeatomära kombinationer av kopiera och radera). I idealfallet kan organisationer sammanslå alla sina lagringslösningar i ett och samma filträd för att undvika sådana beteendeförändringar och därför minska administrationen, samt öka användningen av den hårdvara de väljer att behålla. HopsFS är en ny utgåva av Apache HDFS, utvecklad på KTH, som använder en minnesbaserad distribuerad databaslösning för att lagra metadata. Lösningen kan hantera en klusterstorlek på 10000 noder och har visat att det i princip kan stöda klusterstorlekar på upp till femton gånger Apache HDFS. Ett av de hinder som kvarstår för att HopsFS ska kunna nå dessa nivåer är en så-småningom-konsekvent algoritm för dataförlustskydd i Apache HDFS som kallas Block Report. Detta arbete föreslår en lösning för att öka skalbarheten i HDFS Block Report som använder sig av en hash-baserad så-småningom-konsekvent mekanism för att undvika dubbelt arbete. Simuleringar indikerar att den nya lösningen i genomsnitt kan minska trycket på databasen med en hel storleksordning, till en prestandakostnad om mindre än tio procent på filsystemets vanliga operationer, medan databasanvändningen i värsta-fallet är jämförbart med den gamla lösningen.
author	Bonds, August
author_facet	Bonds, August
author_sort	Bonds, August
title	Hash-based Eventual Consistency to Scale the HDFS Block Report
title_short	Hash-based Eventual Consistency to Scale the HDFS Block Report
title_full	Hash-based Eventual Consistency to Scale the HDFS Block Report
title_fullStr	Hash-based Eventual Consistency to Scale the HDFS Block Report
title_full_unstemmed	Hash-based Eventual Consistency to Scale the HDFS Block Report
title_sort	hash-based eventual consistency to scale the hdfs block report
publisher	KTH, Skolan för informations- och kommunikationsteknik (ICT)
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-222363
work_keys_str_mv	AT bondsaugust hashbasedeventualconsistencytoscalethehdfsblockreport
_version_	1718613891609526272

Hash-based Eventual Consistency to Scale the HDFS Block Report

Similar Items