SQL on Hops

In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the develo...

Full description

Bibliographic Details
Main Author:	Buso, Fabio
Format:	Others
Language:	English
Published:	KTH, Skolan för informations- och kommunikationsteknik (ICT) 2017
Subjects:	Computer Sciences Datavetenskap (datalogi)
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692

id	ndltd-UPSALLA1-oai-DiVA.org-kth-215692
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2156922018-01-14T05:11:04ZSQL on HopsengBuso, FabioKTH, Skolan för informations- och kommunikationsteknik (ICT)2017Computer SciencesDatavetenskap (datalogi)In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692TRITA-ICT-EX ; 2017:146application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Computer Sciences Datavetenskap (datalogi)
spellingShingle	Computer Sciences Datavetenskap (datalogi) Buso, Fabio SQL on Hops
description	In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model.
author	Buso, Fabio
author_facet	Buso, Fabio
author_sort	Buso, Fabio
title	SQL on Hops
title_short	SQL on Hops
title_full	SQL on Hops
title_fullStr	SQL on Hops
title_full_unstemmed	SQL on Hops
title_sort	sql on hops
publisher	KTH, Skolan för informations- och kommunikationsteknik (ICT)
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692
work_keys_str_mv	AT busofabio sqlonhops
_version_	1718609807672344576

SQL on Hops

Similar Items