Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis

Abstract Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remot...

Full description

Bibliographic Details
Main Authors: Steffen Möller, Stuart W. Prescott, Lars Wirzenius, Petter Reinholdtsen, Brad Chapman, Pjotr Prins, Stian Soiland-Reyes, Fabian Klötzl, Andrea Bagnacani, Matúš Kalaš, Andreas Tille, Michael R. Crusoe
Format: Article
Language:English
Published: SpringerOpen 2017-11-01
Series:Data Science and Engineering
Subjects:
Online Access:http://link.springer.com/article/10.1007/s41019-017-0050-4
id doaj-dc6f615420f949e497aea96eb2551a47
record_format Article
spelling doaj-dc6f615420f949e497aea96eb2551a472021-03-02T09:08:14ZengSpringerOpenData Science and Engineering2364-11852364-15412017-11-012323224410.1007/s41019-017-0050-4Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data AnalysisSteffen Möller0Stuart W. Prescott1Lars Wirzenius2Petter Reinholdtsen3Brad Chapman4Pjotr Prins5Stian Soiland-Reyes6Fabian Klötzl7Andrea Bagnacani8Matúš Kalaš9Andreas Tille10Michael R. Crusoe11Rostock University Medical Center, Institute for Biostatistics and Informatics in Medicine and Ageing ResearchDebian ProjectDebian ProjectDebian ProjectHarvard School of Public HealthUniversity Medical Center UtrechteScience Lab, School of Computer Science, The University of ManchesterMax-Planck-Institute for Evolutionary BiologyDepartment of Systems Biology and Bioinformatics, University of RostockComputational Biology Unit, Department of Informatics, University of BergenDebian ProjectDebian ProjectAbstract Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data collection techniques or novel data analysis, it is essential to have all the flexibility of open-source tools and open-source workflow descriptions. Workflows in data-driven science such as computational biology have considerably gained in complexity. New tools or new releases with additional features arrive at an enormous pace, and new reference data or concepts for quality control are emerging. A well-abstracted workflow and the exchange of the same across work groups have an enormous impact on the efficiency of research and the further development of the field. High-throughput sequencing adds to the avalanche of data available in the field; efficient computation and, in particular, parallel execution motivate the transition from traditional scripts and Makefiles to workflows. We here review the extant software development and distribution model with a focus on the role of integration testing and discuss the effect of common workflow language on distributions of open-source scientific software to swiftly and reliably provide the tools demanded for the execution of such formally described workflows. It is contended that, alleviated from technical differences for the execution on local machines, clusters or the cloud, communities also gain the technical means to test workflow-driven interaction across several software packages.http://link.springer.com/article/10.1007/s41019-017-0050-4Continuous integration testingCommon workflow languageContainerSoftware distributionAutomated installation
collection DOAJ
language English
format Article
sources DOAJ
author Steffen Möller
Stuart W. Prescott
Lars Wirzenius
Petter Reinholdtsen
Brad Chapman
Pjotr Prins
Stian Soiland-Reyes
Fabian Klötzl
Andrea Bagnacani
Matúš Kalaš
Andreas Tille
Michael R. Crusoe
spellingShingle Steffen Möller
Stuart W. Prescott
Lars Wirzenius
Petter Reinholdtsen
Brad Chapman
Pjotr Prins
Stian Soiland-Reyes
Fabian Klötzl
Andrea Bagnacani
Matúš Kalaš
Andreas Tille
Michael R. Crusoe
Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
Data Science and Engineering
Continuous integration testing
Common workflow language
Container
Software distribution
Automated installation
author_facet Steffen Möller
Stuart W. Prescott
Lars Wirzenius
Petter Reinholdtsen
Brad Chapman
Pjotr Prins
Stian Soiland-Reyes
Fabian Klötzl
Andrea Bagnacani
Matúš Kalaš
Andreas Tille
Michael R. Crusoe
author_sort Steffen Möller
title Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
title_short Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
title_full Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
title_fullStr Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
title_full_unstemmed Robust Cross-Platform Workflows: How Technical and Scientific Communities Collaborate to Develop, Test and Share Best Practices for Data Analysis
title_sort robust cross-platform workflows: how technical and scientific communities collaborate to develop, test and share best practices for data analysis
publisher SpringerOpen
series Data Science and Engineering
issn 2364-1185
2364-1541
publishDate 2017-11-01
description Abstract Information integration and workflow technologies for data analysis have always been major fields of investigation in bioinformatics. A range of popular workflow suites are available to support analyses in computational biology. Commercial providers tend to offer prepared applications remote to their clients. However, for most academic environments with local expertise, novel data collection techniques or novel data analysis, it is essential to have all the flexibility of open-source tools and open-source workflow descriptions. Workflows in data-driven science such as computational biology have considerably gained in complexity. New tools or new releases with additional features arrive at an enormous pace, and new reference data or concepts for quality control are emerging. A well-abstracted workflow and the exchange of the same across work groups have an enormous impact on the efficiency of research and the further development of the field. High-throughput sequencing adds to the avalanche of data available in the field; efficient computation and, in particular, parallel execution motivate the transition from traditional scripts and Makefiles to workflows. We here review the extant software development and distribution model with a focus on the role of integration testing and discuss the effect of common workflow language on distributions of open-source scientific software to swiftly and reliably provide the tools demanded for the execution of such formally described workflows. It is contended that, alleviated from technical differences for the execution on local machines, clusters or the cloud, communities also gain the technical means to test workflow-driven interaction across several software packages.
topic Continuous integration testing
Common workflow language
Container
Software distribution
Automated installation
url http://link.springer.com/article/10.1007/s41019-017-0050-4
work_keys_str_mv AT steffenmoller robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT stuartwprescott robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT larswirzenius robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT petterreinholdtsen robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT bradchapman robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT pjotrprins robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT stiansoilandreyes robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT fabianklotzl robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT andreabagnacani robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT matuskalas robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT andreastille robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
AT michaelrcrusoe robustcrossplatformworkflowshowtechnicalandscientificcommunitiescollaboratetodeveloptestandsharebestpracticesfordataanalysis
_version_ 1724240003760914432