An Event-Driven Serverless ETL Pipeline on AWS

This work presents an event-driven Extract, Transform, and Load (ETL) pipeline serverless architecture and provides an evaluation of its performance over a range of dataflow tasks of varying frequency, velocity, and payload size. We design an experiment while using generated tabular data throughout...

Full description

Bibliographic Details
Main Authors: Antreas Pogiatzis, Georgios Samakovitis
Format: Article
Language:English
Published: MDPI AG 2021-12-01
Series:Applied Sciences
Subjects:
AWS
ETL
Online Access:https://www.mdpi.com/2076-3417/11/1/191
id doaj-3691642f30494958b4274cb187230467
record_format Article
spelling doaj-3691642f30494958b4274cb1872304672020-12-29T00:00:28ZengMDPI AGApplied Sciences2076-34172021-12-011119119110.3390/app11010191An Event-Driven Serverless ETL Pipeline on AWSAntreas Pogiatzis0Georgios Samakovitis1School of Computing and Mathematical Sciences, University of Greenwich, Old Royal Naval College, Park Row, Greenwich, London SE10 9LS, UKSchool of Computing and Mathematical Sciences, University of Greenwich, Old Royal Naval College, Park Row, Greenwich, London SE10 9LS, UKThis work presents an event-driven Extract, Transform, and Load (ETL) pipeline serverless architecture and provides an evaluation of its performance over a range of dataflow tasks of varying frequency, velocity, and payload size. We design an experiment while using generated tabular data throughout varying data volumes, event frequencies, and processing power in order to measure: (i) the consistency of pipeline executions; (ii) reliability on data delivery; (iii) maximum payload size per pipeline; and, (iv) economic scalability (cost of chargeable tasks). We run 92 parameterised experiments on a simple AWS architecture, thus avoiding any AWS-enhanced platform features, in order to allow for unbiased assessment of our model’s performance. Our results indicate that our reference architecture can achieve time-consistent data processing of event payloads of more than 100 MB, with a throughput of 750 KB/s across four event frequencies. It is also observed that, although the utilisation of an SQS queue for data transfer enables easy concurrency control and data slicing, it becomes a bottleneck on large sized event payloads. Finally, we develop and discuss a candidate pricing model for our reference architecture usage.https://www.mdpi.com/2076-3417/11/1/191serverlessFaaSevent-drivendistributedAWSETL
collection DOAJ
language English
format Article
sources DOAJ
author Antreas Pogiatzis
Georgios Samakovitis
spellingShingle Antreas Pogiatzis
Georgios Samakovitis
An Event-Driven Serverless ETL Pipeline on AWS
Applied Sciences
serverless
FaaS
event-driven
distributed
AWS
ETL
author_facet Antreas Pogiatzis
Georgios Samakovitis
author_sort Antreas Pogiatzis
title An Event-Driven Serverless ETL Pipeline on AWS
title_short An Event-Driven Serverless ETL Pipeline on AWS
title_full An Event-Driven Serverless ETL Pipeline on AWS
title_fullStr An Event-Driven Serverless ETL Pipeline on AWS
title_full_unstemmed An Event-Driven Serverless ETL Pipeline on AWS
title_sort event-driven serverless etl pipeline on aws
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2021-12-01
description This work presents an event-driven Extract, Transform, and Load (ETL) pipeline serverless architecture and provides an evaluation of its performance over a range of dataflow tasks of varying frequency, velocity, and payload size. We design an experiment while using generated tabular data throughout varying data volumes, event frequencies, and processing power in order to measure: (i) the consistency of pipeline executions; (ii) reliability on data delivery; (iii) maximum payload size per pipeline; and, (iv) economic scalability (cost of chargeable tasks). We run 92 parameterised experiments on a simple AWS architecture, thus avoiding any AWS-enhanced platform features, in order to allow for unbiased assessment of our model’s performance. Our results indicate that our reference architecture can achieve time-consistent data processing of event payloads of more than 100 MB, with a throughput of 750 KB/s across four event frequencies. It is also observed that, although the utilisation of an SQS queue for data transfer enables easy concurrency control and data slicing, it becomes a bottleneck on large sized event payloads. Finally, we develop and discuss a candidate pricing model for our reference architecture usage.
topic serverless
FaaS
event-driven
distributed
AWS
ETL
url https://www.mdpi.com/2076-3417/11/1/191
work_keys_str_mv AT antreaspogiatzis aneventdrivenserverlessetlpipelineonaws
AT georgiossamakovitis aneventdrivenserverlessetlpipelineonaws
AT antreaspogiatzis eventdrivenserverlessetlpipelineonaws
AT georgiossamakovitis eventdrivenserverlessetlpipelineonaws
_version_ 1724368249069502464