Parallelizing user–defined functions in the ETL workflow using orchestration style sheets

Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized...

Full description

Bibliographic Details
Main Authors: Ali Syed Muhammad Fawad, Mey Johannes, Thiele Maik
Format: Article
Language:English
Published: Sciendo 2019-03-01
Series:International Journal of Applied Mathematics and Computer Science
Subjects:
Online Access:https://doi.org/10.2478/amcs-2019-0005
id doaj-26067912ad9c47aabd386140ea3846f0
record_format Article
spelling doaj-26067912ad9c47aabd386140ea3846f02021-09-06T19:41:09ZengSciendoInternational Journal of Applied Mathematics and Computer Science2083-84922019-03-01291697910.2478/amcs-2019-0005amcs-2019-0005Parallelizing user–defined functions in the ETL workflow using orchestration style sheetsAli Syed Muhammad Fawad0Mey Johannes1Thiele Maik2Faculty of Computing, Poznań University of Technology, Piotrowo 2, 60-965Poznań, PolandFaculty of Computer Science, Technical University of Dresden, Helmholtzstrasse 10, 01069, Dresden, GermanyFaculty of Computer Science, Technical University of Dresden, Helmholtzstrasse 10, 01069, Dresden, GermanyToday’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.https://doi.org/10.2478/amcs-2019-0005etl workflowparallel etl operatorsparallel algorithmic skeletonsuser-defined functions
collection DOAJ
language English
format Article
sources DOAJ
author Ali Syed Muhammad Fawad
Mey Johannes
Thiele Maik
spellingShingle Ali Syed Muhammad Fawad
Mey Johannes
Thiele Maik
Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
International Journal of Applied Mathematics and Computer Science
etl workflow
parallel etl operators
parallel algorithmic skeletons
user-defined functions
author_facet Ali Syed Muhammad Fawad
Mey Johannes
Thiele Maik
author_sort Ali Syed Muhammad Fawad
title Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
title_short Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
title_full Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
title_fullStr Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
title_full_unstemmed Parallelizing user–defined functions in the ETL workflow using orchestration style sheets
title_sort parallelizing user–defined functions in the etl workflow using orchestration style sheets
publisher Sciendo
series International Journal of Applied Mathematics and Computer Science
issn 2083-8492
publishDate 2019-03-01
description Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.
topic etl workflow
parallel etl operators
parallel algorithmic skeletons
user-defined functions
url https://doi.org/10.2478/amcs-2019-0005
work_keys_str_mv AT alisyedmuhammadfawad parallelizinguserdefinedfunctionsintheetlworkflowusingorchestrationstylesheets
AT meyjohannes parallelizinguserdefinedfunctionsintheetlworkflowusingorchestrationstylesheets
AT thielemaik parallelizinguserdefinedfunctionsintheetlworkflowusingorchestrationstylesheets
_version_ 1717766967881891840