SQL on Hops
In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the develo...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
KTH, Skolan för informations- och kommunikationsteknik (ICT)
2017
|
Subjects: | |
Online Access: | http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692 |
id |
ndltd-UPSALLA1-oai-DiVA.org-kth-215692 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UPSALLA1-oai-DiVA.org-kth-2156922018-01-14T05:11:04ZSQL on HopsengBuso, FabioKTH, Skolan för informations- och kommunikationsteknik (ICT)2017Computer SciencesDatavetenskap (datalogi)In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692TRITA-ICT-EX ; 2017:146application/pdfinfo:eu-repo/semantics/openAccess |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
topic |
Computer Sciences Datavetenskap (datalogi) |
spellingShingle |
Computer Sciences Datavetenskap (datalogi) Buso, Fabio SQL on Hops |
description |
In today’s world data is extremely valuable. Companies and researchers store every sort of data, from users activities to medical records. However, data is useless if one cannot extract meaning and insight from it. In 2004 Dean and Ghemawat introduced the MapReduce framework. This sparked the development of open source frameworks for big data storage (HDFS) and processing (Hadoop). Hops and Apache Hive build on top of this heritage. The former proposes a new distributed file system which achieves higher scalability and throughput by storing metadata in a database called MySQL-Cluster. The latter is an open source data warehousing solution built on top of the Hadoop ecosystems, which allows users to query big data stored on HDFS using a SQL-like query language.Apache Hive is a widely used and mature project, however it lacks of consistency between the data stored on the file system and the metadata describing it, stored on a relational database. This means that if users delete Hive’s data from the file system, Hive does not delete the related metadata. This causes two issues: (1) users do not get an error if the data is missing from the filesystem (2) if users forget to delete the metadata, it will become orphaned in the database. In this thesis we exploit the fact that both HopsFS’ metadata and Hive’s metadata is stored in a relational database, to provide a mechanisms to automatically delete Hive’s metadata if the data is delete from the file system.The second objective of this thesis is to integrate Apache Hive into the Hops ecosystem and in particular in the HopsWorks platform. HopsWorks is a multitenant, UI based service which allows users to store and process big data projects. In this thesis we develop a custom authenticator for Hive to allow HopsWorks users to authenticate with Hive and to integrate with its security model. |
author |
Buso, Fabio |
author_facet |
Buso, Fabio |
author_sort |
Buso, Fabio |
title |
SQL on Hops |
title_short |
SQL on Hops |
title_full |
SQL on Hops |
title_fullStr |
SQL on Hops |
title_full_unstemmed |
SQL on Hops |
title_sort |
sql on hops |
publisher |
KTH, Skolan för informations- och kommunikationsteknik (ICT) |
publishDate |
2017 |
url |
http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215692 |
work_keys_str_mv |
AT busofabio sqlonhops |
_version_ |
1718609807672344576 |