Requirement-driven Design and Optimization of Data-Intensive Flows

Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deployi...

Full description

Bibliographic Details
Main Author: Jovanovic, Petar
Other Authors: Abello, Alberto
Format: Doctoral Thesis
Language:en
Published: Universite Libre de Bruxelles 2016
Subjects:
ETL
Online Access:http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500
id ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-237500
record_format oai_dc
collection NDLTD
language en
format Doctoral Thesis
sources NDLTD
topic Analyse de systèmes informatiques
Informatique générale
data-intensive flows
workflow management
optimization
business intelligence
ETL
Data Warehousing
spellingShingle Analyse de systèmes informatiques
Informatique générale
data-intensive flows
workflow management
optimization
business intelligence
ETL
Data Warehousing
Jovanovic, Petar
Requirement-driven Design and Optimization of Data-Intensive Flows
description Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g. social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e. at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e. ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows. === Doctorat en Sciences de l'ingénieur et technologie === info:eu-repo/semantics/nonPublished
author2 Abello, Alberto
author_facet Abello, Alberto
Jovanovic, Petar
author Jovanovic, Petar
author_sort Jovanovic, Petar
title Requirement-driven Design and Optimization of Data-Intensive Flows
title_short Requirement-driven Design and Optimization of Data-Intensive Flows
title_full Requirement-driven Design and Optimization of Data-Intensive Flows
title_fullStr Requirement-driven Design and Optimization of Data-Intensive Flows
title_full_unstemmed Requirement-driven Design and Optimization of Data-Intensive Flows
title_sort requirement-driven design and optimization of data-intensive flows
publisher Universite Libre de Bruxelles
publishDate 2016
url http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500
work_keys_str_mv AT jovanovicpetar requirementdrivendesignandoptimizationofdataintensiveflows
_version_ 1718630975056904192
spelling ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-2375002018-04-11T17:38:13Z info:eu-repo/semantics/doctoralThesis info:ulb-repo/semantics/doctoralThesis info:ulb-repo/semantics/openurl/vlink-dissertation Requirement-driven Design and Optimization of Data-Intensive Flows Jovanovic, Petar Abello, Alberto Calders, Toon Romero, Oscar O. Zimanyi, Esteban Vassiliadis, Panos P.V. Lehner, Wolfgang W.L. Urpí, Toni T. U. Vansummeren, Stijn Universite Libre de Bruxelles Universitat Politècnica de Catalunya, BarcelonaTech, Department of Service and Information System Engineering Université libre de Bruxelles, Ecole polytechnique de Bruxelles – Informatique, Bruxelles 2016-09-26 en Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g. social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e. at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e. ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows. Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing 245 p. Doctorat en Sciences de l'ingénieur et technologie info:eu-repo/semantics/nonPublished http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500 No full-text files