Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates

Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect...

Full description

Bibliographic Details
Main Author: Idris, Muhammad
Other Authors: Vansummeren, Stijn
Format: Doctoral Thesis
Language:en
Published: Universite Libre de Bruxelles 2019
Subjects:
Online Access:https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/3/TableOfContents.pdf
https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/5/contratMI.pdf
https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/4/PhD-Thesis_Muhammad_Idris.pdf
http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/284705
id ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-284705
record_format oai_dc
collection NDLTD
language en
format Doctoral Thesis
sources NDLTD
topic Informatique générale
Technologies de l'information et de la communication (TIC)
Business Intelligence
Databases
Data Warehouse
Query Processing
Query Execution
Real-time Analytics
Stream Processing
Complex Event Processing
Information Flow Processing
Joins
Join Trees
Main-Memory System
Inequality Joins
Theta Joins
Analytical Processing
Query Language
Acyclic Joins
Join Algorithms
Acyclicity
spellingShingle Informatique générale
Technologies de l'information et de la communication (TIC)
Business Intelligence
Databases
Data Warehouse
Query Processing
Query Execution
Real-time Analytics
Stream Processing
Complex Event Processing
Information Flow Processing
Joins
Join Trees
Main-Memory System
Inequality Joins
Theta Joins
Analytical Processing
Query Language
Acyclic Joins
Join Algorithms
Acyclicity
Idris, Muhammad
Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
description Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. === Doctorat en Sciences de l'ingénieur et technologie === info:eu-repo/semantics/nonPublished
author2 Vansummeren, Stijn
author_facet Vansummeren, Stijn
Idris, Muhammad
author Idris, Muhammad
author_sort Idris, Muhammad
title Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
title_short Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
title_full Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
title_fullStr Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
title_full_unstemmed Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates
title_sort real-time business intelligence through compact and efficient query processing under updates
publisher Universite Libre de Bruxelles
publishDate 2019
url https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/3/TableOfContents.pdf
https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/5/contratMI.pdf
https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/4/PhD-Thesis_Muhammad_Idris.pdf
http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/284705
work_keys_str_mv AT idrismuhammad realtimebusinessintelligencethroughcompactandefficientqueryprocessingunderupdates
_version_ 1718990177045577728
spelling ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-2847052019-03-04T17:45:30Z info:eu-repo/semantics/doctoralThesis info:ulb-repo/semantics/doctoralThesis info:ulb-repo/semantics/openurl/vlink-dissertation Real-time Business Intelligence through Compact and Efficient Query Processing Under Updates Idris, Muhammad Vansummeren, Stijn Lehner, Wolfgang W.L. Zimanyi, Esteban Sakr, Mahmoud Schill, Alexander A.L. Fletcher, George Universite Libre de Bruxelles Technische Universität Dresden, Faculty of Computer Science Université libre de Bruxelles, Ecole polytechnique de Bruxelles – Informatique, Bruxelles 2019-03-05 en Responsive analytics are rapidly taking over the traditional data analytics dominated by the post-fact approaches in traditional data warehousing. Recent advancements in analytics demand placing analytical engines at the forefront of the system to react to updates occurring at high speed and detect patterns, trends, and anomalies. These kinds of solutions find applications in Financial Systems, Industrial Control Systems, Business Intelligence and on-line Machine Learning among others. These applications are usually associated with Big Data and require the ability to react to constantly changing data in order to obtain timely insights and take proactive measures. Generally, these systems specify the analytical results or their basic elements in a query language, where the main task then is to maintain query results under frequent updates efficiently. The task of reacting to updates and analyzing changing data has been addressed in two ways in the literature: traditional business intelligence (BI) solutions focus on historical data analysis where the data is refreshed periodically and in batches, and stream processing solutions process streams of data from transient sources as flows of data items. Both kinds of systems share the niche of reacting to updates (known as dynamic evaluation), however, they differ in architecture, query languages, and processing mechanisms. In this thesis, we investigate the possibility of a reactive and unified framework to model queries that appear in both kinds of systems.In traditional BI solutions, evaluating queries under updates has been studied under the umbrella of incremental evaluation of queries that are based on the relational incremental view maintenance model and mostly focus on queries that feature equi-joins. Streaming systems, in contrast, generally follow automaton based models to evaluate queries under updates, and they generally process queries that mostly feature comparisons of temporal attributes (e.g. timestamp attributes) along with comparisons of non-temporal attributes over streams of bounded sizes. Temporal comparisons constitute inequality constraints while non-temporal comparisons can either be equality or inequality constraints. Hence these systems mostly process inequality joins. As a starting point for our research, we postulate the thesis that queries in streaming systems can also be evaluated efficiently based on the paradigm of incremental evaluation just like in BI systems in a main-memory model. The efficiency of such a model is measured in terms of runtime memory footprint and the update processing cost. To this end, the existing approaches of dynamic evaluation in both kinds of systems present a trade-off between memory footprint and the update processing cost. More specifically, systems that avoid materialization of query (sub)results incur high update latency and systems that materialize (sub)results incur high memory footprint. We are interested in investigating the possibility to build a model that can address this trade-off. In particular, we overcome this trade-off by investigating the possibility of practical dynamic evaluation algorithm for queries that appear in both kinds of systems and present a main-memory data representation that allows to enumerate query (sub)results without materialization and can be maintained efficiently under updates. We call this representation the Dynamic Constant Delay Linear Representation (DCLRs).We devise DCLRs with the following properties: 1) they allow, without materialization, enumeration of query results with bounded-delay (and with constant delay for a sub-class of queries), 2) they allow tuple lookup in query results with logarithmic delay (and with constant delay for conjunctive queries with equi-joins only), 3) they take space linear in the size of the database, 4) they can be maintained efficiently under updates. We first study the DCLRs with the above-described properties for the class of acyclic conjunctive queries featuring equi-joins with projections and present the dynamic evaluation algorithm called the Dynamic Yannakakis (DYN) algorithm. Then, we present the generalization of the DYN algorithm to the class of acyclic queries featuring multi-way Theta-joins with projections and call it Generalized DYN (GDYN). We devise DCLRs with the above properties for acyclic conjunctive queries, and the working of DYN and GDYN over DCLRs are based on a particular variant of join trees, called the Generalized Join Trees (GJTs) that guarantee the above-described properties of DCLRs. We define GJTs and present algorithms to test a conjunctive query featuring Theta-joins for acyclicity and to generate GJTs for such queries. We extend the classical GYO algorithm from testing a conjunctive query with equalities for acyclicity to testing a conjunctive query featuring multi-way Theta-joins with projections for acyclicity. We further extend the GYO algorithm to generate GJTs for queries that are acyclic.GDYN is hence a unified framework based on DCLRs that enables processing of queries that appear in streaming systems as well as in BI systems in a unified main-memory model and addresses the space-time trade-off. We instantiate GDYN to the particular case where all Theta-joins involve only equalities and inequalities and call this instantiation IEDYN. We implement DYN and IEDYN as query compilers that generate executable programs in the Scala programming language and provide all the necessary data structures and their maintenance and enumeration methods in a continuous stream processing model. We evaluate DYN and IEDYN against state-of-the-art BI and streaming systems on both industrial and synthetically generated benchmarks. We show that DYN and IEDYN outperform the existing systems by over an order of magnitude efficiency in both memory footprint and update processing time. Informatique générale Technologies de l'information et de la communication (TIC) Business Intelligence Databases Data Warehouse Query Processing Query Execution Real-time Analytics Stream Processing Complex Event Processing Information Flow Processing Joins Join Trees Main-Memory System Inequality Joins Theta Joins Analytical Processing Query Language Acyclic Joins Join Algorithms Acyclicity Doctorat en Sciences de l'ingénieur et technologie info:eu-repo/semantics/nonPublished https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/3/TableOfContents.pdf https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/5/contratMI.pdf https://dipot.ulb.ac.be/dspace/bitstream/2013/284705/4/PhD-Thesis_Muhammad_Idris.pdf http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/284705 3 full-text file(s): application/pdf | application/pdf | application/pdf 3 full-text file(s): info:eu-repo/semantics/openAccess | info:eu-repo/semantics/closedAccess | info:eu-repo/semantics/restrictedAccess