SQL on Hops

In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the develo...

Full description

Bibliographic Details
Main Author: Buso, Fabio
Format: Others
Language:English
Published: KTH, Skolan för informations- och kommunikationsteknik (ICT) 2017
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692
id ndltd-UPSALLA1-oai-DiVA.org-kth-215692
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-kth-2156922018-01-14T05:11:04ZSQL on HopsengBuso, FabioKTH, Skolan för informations- och kommunikationsteknik (ICT)2017Computer SciencesDatavetenskap (datalogi)In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692TRITA-ICT-EX ; 2017:146application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Computer Sciences
Datavetenskap (datalogi)
spellingShingle Computer Sciences
Datavetenskap (datalogi)
Buso, Fabio
SQL on Hops
description In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model.
author Buso, Fabio
author_facet Buso, Fabio
author_sort Buso, Fabio
title SQL on Hops
title_short SQL on Hops
title_full SQL on Hops
title_fullStr SQL on Hops
title_full_unstemmed SQL on Hops
title_sort sql on hops
publisher KTH, Skolan för informations- och kommunikationsteknik (ICT)
publishDate 2017
url http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692
work_keys_str_mv AT busofabio sqlonhops
_version_ 1718609807672344576