Accelerating SPARQL Queries and Analytics on RDF Data

The complexity of SPARQL queries and RDF applications poses great challenges on distributed RDF management systems. SPARQL workloads are dynamic and con- sist of queries with variable complexities. Hence, systems that use static partitioning su↵er from communication overhead for workloads that gener...

Full description

Bibliographic Details
Main Author:	Al-Harbi, Razen
Other Authors:	Kalnis, Panos
Language:	en
Published:	2016
Subjects:	RDF SPARQL Distributed Databases Parallele Processing Query Optimization Adaptive Partitioning
Online Access:	Al-Harbi, R. (2016). Accelerating SPARQL Queries and Analytics on RDF Data. KAUST Research Repository. https://doi.org/10.25781/KAUST-HH33E http://hdl.handle.net/10754/621815

id	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-621815
record_format	oai_dc
spelling	ndltd-kaust.edu.sa-oai-repository.kaust.edu.sa-10754-6218152021-08-30T05:09:27Z Accelerating SPARQL Queries and Analytics on RDF Data Al-Harbi, Razen Kalnis, Panos Computer, Electrical and Mathematical Science and Engineering (CEMSE) Division Canini, Marco Salama, Khaled N. Vlachos, Michail. RDF SPARQL Distributed Databases Parallele Processing Query Optimization Adaptive Partitioning The complexity of SPARQL queries and RDF applications poses great challenges on distributed RDF management systems. SPARQL workloads are dynamic and con- sist of queries with variable complexities. Hence, systems that use static partitioning su↵er from communication overhead for workloads that generate excessive communi- cation. Concurrently, RDF applications are becoming more sophisticated, mandating analytical operations that extend beyond SPARQL queries. Being primarily designed and optimized to execute SPARQL queries, which lack procedural capabilities, exist- ing systems are not suitable for rich RDF analytics. This dissertation tackles the problem of accelerating SPARQL queries and RDF analytics on distributed shared-nothing RDF systems. First, a distributed RDF en- gine, coined AdPart, is introduced. AdPart uses lightweight hash partitioning for sharding triples using their subject values; rendering its startup overhead very low. The locality-aware query optimizer of AdPart takes full advantage of the partition- ing to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. By exploiting hash- based locality, AdPart achieves better or comparable performance to systems that employ sophisticated partitioning schemes. To cope with workloads dynamism, AdPart is extended to dynamically adapt to workload changes. AdPart monitors the data access patterns and dynamically redis- tributes and replicates the instances of the most frequent patterns among workers.Consequently, the communication cost for future queries is drastically reduced or even eliminated. Experiments with synthetic and real data verify that AdPart starts faster than all existing systems and gracefully adapts to the query load. Finally, to support and accelerate rich RDF analytical tasks, a vertex-centric RDF analytics framework is proposed. The framework, named SPARTex, bridges the gap between RDF and graph processing. To do so, SPARTex: (i) implements a generic SPARQL operator as a vertex-centric program. The operator is coupled with an optimizer that generates e cient execution plans. (ii) It allows SPARQL to invoke vertex-centric programs as stored procedures. Finally, (iii) it provides a unified in- memory data store that allows the persistence of intermediate results. Consequently, SPARTex can e ciently support RDF analytical tasks consisting of complex pipeline of operators. 2016-11-10T07:29:20Z 2016-11-10T07:29:20Z 2016-11-09 Dissertation Al-Harbi, R. (2016). Accelerating SPARQL Queries and Analytics on RDF Data. KAUST Research Repository. https://doi.org/10.25781/KAUST-HH33E 10.25781/KAUST-HH33E http://hdl.handle.net/10754/621815 en
collection	NDLTD
language	en
sources	NDLTD
topic	RDF SPARQL Distributed Databases Parallele Processing Query Optimization Adaptive Partitioning
spellingShingle	RDF SPARQL Distributed Databases Parallele Processing Query Optimization Adaptive Partitioning Al-Harbi, Razen Accelerating SPARQL Queries and Analytics on RDF Data
description	The complexity of SPARQL queries and RDF applications poses great challenges on distributed RDF management systems. SPARQL workloads are dynamic and con- sist of queries with variable complexities. Hence, systems that use static partitioning su↵er from communication overhead for workloads that generate excessive communi- cation. Concurrently, RDF applications are becoming more sophisticated, mandating analytical operations that extend beyond SPARQL queries. Being primarily designed and optimized to execute SPARQL queries, which lack procedural capabilities, exist- ing systems are not suitable for rich RDF analytics. This dissertation tackles the problem of accelerating SPARQL queries and RDF analytics on distributed shared-nothing RDF systems. First, a distributed RDF en- gine, coined AdPart, is introduced. AdPart uses lightweight hash partitioning for sharding triples using their subject values; rendering its startup overhead very low. The locality-aware query optimizer of AdPart takes full advantage of the partition- ing to (i) support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. By exploiting hash- based locality, AdPart achieves better or comparable performance to systems that employ sophisticated partitioning schemes. To cope with workloads dynamism, AdPart is extended to dynamically adapt to workload changes. AdPart monitors the data access patterns and dynamically redis- tributes and replicates the instances of the most frequent patterns among workers.Consequently, the communication cost for future queries is drastically reduced or even eliminated. Experiments with synthetic and real data verify that AdPart starts faster than all existing systems and gracefully adapts to the query load. Finally, to support and accelerate rich RDF analytical tasks, a vertex-centric RDF analytics framework is proposed. The framework, named SPARTex, bridges the gap between RDF and graph processing. To do so, SPARTex: (i) implements a generic SPARQL operator as a vertex-centric program. The operator is coupled with an optimizer that generates e cient execution plans. (ii) It allows SPARQL to invoke vertex-centric programs as stored procedures. Finally, (iii) it provides a unified in- memory data store that allows the persistence of intermediate results. Consequently, SPARTex can e ciently support RDF analytical tasks consisting of complex pipeline of operators.
author2	Kalnis, Panos
author_facet	Kalnis, Panos Al-Harbi, Razen
author	Al-Harbi, Razen
author_sort	Al-Harbi, Razen
title	Accelerating SPARQL Queries and Analytics on RDF Data
title_short	Accelerating SPARQL Queries and Analytics on RDF Data
title_full	Accelerating SPARQL Queries and Analytics on RDF Data
title_fullStr	Accelerating SPARQL Queries and Analytics on RDF Data
title_full_unstemmed	Accelerating SPARQL Queries and Analytics on RDF Data
title_sort	accelerating sparql queries and analytics on rdf data
publishDate	2016
url	Al-Harbi, R. (2016). Accelerating SPARQL Queries and Analytics on RDF Data. KAUST Research Repository. https://doi.org/10.25781/KAUST-HH33E http://hdl.handle.net/10754/621815
work_keys_str_mv	AT alharbirazen acceleratingsparqlqueriesandanalyticsonrdfdata
_version_	1719472710709411840

Accelerating SPARQL Queries and Analytics on RDF Data

Similar Items