Requirement-driven Design and Optimization of Data-Intensive Flows

Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deployi...

Full description

Bibliographic Details
Main Author:	Jovanovic, Petar
Other Authors:	Abello, Alberto
Format:	Doctoral Thesis
Language:	en
Published:	Universite Libre de Bruxelles 2016
Subjects:	Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing
Online Access:	http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500

id	ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-237500
record_format	oai_dc
collection	NDLTD
language	en
format	Doctoral Thesis
sources	NDLTD
topic	Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing
spellingShingle	Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing Jovanovic, Petar Requirement-driven Design and Optimization of Data-Intensive Flows
description	Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g. social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e. at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e. ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows. === Doctorat en Sciences de l'ingénieur et technologie === info:eu-repo/semantics/nonPublished
author2	Abello, Alberto
author_facet	Abello, Alberto Jovanovic, Petar
author	Jovanovic, Petar
author_sort	Jovanovic, Petar
title	Requirement-driven Design and Optimization of Data-Intensive Flows
title_short	Requirement-driven Design and Optimization of Data-Intensive Flows
title_full	Requirement-driven Design and Optimization of Data-Intensive Flows
title_fullStr	Requirement-driven Design and Optimization of Data-Intensive Flows
title_full_unstemmed	Requirement-driven Design and Optimization of Data-Intensive Flows
title_sort	requirement-driven design and optimization of data-intensive flows
publisher	Universite Libre de Bruxelles
publishDate	2016
url	http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500
work_keys_str_mv	AT jovanovicpetar requirementdrivendesignandoptimizationofdataintensiveflows
_version_	1718630975056904192
spelling	ndltd-ulb.ac.be-oai-dipot.ulb.ac.be-2013-2375002018-04-11T17:38:13Z info:eu-repo/semantics/doctoralThesis info:ulb-repo/semantics/doctoralThesis info:ulb-repo/semantics/openurl/vlink-dissertation Requirement-driven Design and Optimization of Data-Intensive Flows Jovanovic, Petar Abello, Alberto Calders, Toon Romero, Oscar O. Zimanyi, Esteban Vassiliadis, Panos P.V. Lehner, Wolfgang W.L. Urpí, Toni T. U. Vansummeren, Stijn Universite Libre de Bruxelles Universitat Politècnica de Catalunya, BarcelonaTech, Department of Service and Information System Engineering Université libre de Bruxelles, Ecole polytechnique de Bruxelles – Informatique, Bruxelles 2016-09-26 en Data have become number one assets of today's business world. Thus, its exploitation and analysis attracted the attention of people from different fields and having different technical backgrounds. Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. However, designing and optimizing such data flows, to satisfy both users' information needs and agreed quality standards, have been known as a burdensome task, typically left to the manual efforts of a BI system designer. These tasks have become even more challenging for next generation BI systems, where data flows typically need to combine data from in-house transactional storages, and data coming from external sources, in a variety of formats (e.g. social media, governmental data, news feeds). Moreover, for making an impact to business outcomes, data flows are expected to answer unanticipated analytical needs of a broader set of business users' and deliver valuable information in near real-time (i.e. at the right time). These challenges largely indicate a need for boosting the automation of the design and optimization of data-intensive flows. This PhD thesis aims at providing automatable means for managing the lifecycle of data-intensive flows. The study primarily analyzes the remaining challenges to be solved in the field of data-intensive flows, by performing a survey of current literature, and envisioning an architecture for managing the lifecycle of data-intensive flows. Following the proposed architecture, we further focus on providing automatic techniques for covering different phases of the data-intensive flows' lifecycle. In particular, the thesis first proposes an approach (CoAl) for incremental design of data-intensive flows, by means of multi-flow consolidation. CoAl not only facilitates the maintenance of data flow designs in front of changing information needs, but also supports the multi-flow optimization of data-intensive flows, by maximizing their reuse. Next, in the data warehousing (DW) context, we propose a complementary method (ORE) for incremental design of the target DW schema, along with systematically tracing the evolution metadata, which can further facilitate the design of back-end data-intensive flows (i.e. ETL processes). The thesis then studies the problem of implementing data-intensive flows into deployable formats of different execution engines, and proposes the BabbleFlow system for translating logical data-intensive flows into executable formats, spanning single or multiple execution engines. Lastly, the thesis focuses on managing the execution of data-intensive flows on distributed data processing platforms, and to this end, proposes an algorithm (H-WorD) for supporting the scheduling of data-intensive flows by workload-driven redistribution of data in computing clusters. The overall outcome of this thesis an end-to-end platform for managing the lifecycle of data-intensive flows, called Quarry. The techniques proposed in this thesis, plugged to the Quarry platform, largely facilitate the manual efforts, and assist users of different technical skills in their analytical tasks. Finally, the results of this thesis largely contribute to the field of data-intensive flows in today's BI systems, and advocate for further attention by both academia and industry to the problems of design and optimization of data-intensive flows. Analyse de systèmes informatiques Informatique générale data-intensive flows workflow management optimization business intelligence ETL Data Warehousing 245 p. Doctorat en Sciences de l'ingénieur et technologie info:eu-repo/semantics/nonPublished http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/237500 No full-text files

Requirement-driven Design and Optimization of Data-Intensive Flows

Similar Items