A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

Abstract Background Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a par...

Full description

Bibliographic Details
Main Authors:	Cieślik Marcin, Mura Cameron
Format:	Article
Language:	English
Published:	BMC 2011-02-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/12/61

id	doaj-380ec38d3e79408b822a199499ba8c1b
record_format	Article
spelling	doaj-380ec38d3e79408b822a199499ba8c1b2020-11-24T23:55:18ZengBMCBMC Bioinformatics1471-21052011-02-011216110.1186/1471-2105-12-61A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelinesCieślik MarcinMura Cameron<p>Abstract</p> <p>Background</p> <p>Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.</p> <p>Results</p> <p>To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (<it>e.g</it>., for biomolecular sequences, alignments, structures) and functionality (<it>e.g</it>., to parse/write standard file formats).</p> <p>Conclusions</p> <p>PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at <url>http://muralab.org/PaPy</url>, and includes extensive documentation and annotated usage examples.</p> http://www.biomedcentral.com/1471-2105/12/61
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Cieślik Marcin Mura Cameron
spellingShingle	Cieślik Marcin Mura Cameron A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines BMC Bioinformatics
author_facet	Cieślik Marcin Mura Cameron
author_sort	Cieślik Marcin
title	A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_short	A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_full	A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_fullStr	A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_full_unstemmed	A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_sort	lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2011-02-01
description	<p>Abstract</p> <p>Background</p> <p>Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.</p> <p>Results</p> <p>To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (<it>e.g</it>., for biomolecular sequences, alignments, structures) and functionality (<it>e.g</it>., to parse/write standard file formats).</p> <p>Conclusions</p> <p>PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at <url>http://muralab.org/PaPy</url>, and includes extensive documentation and annotated usage examples.</p>
url	http://www.biomedcentral.com/1471-2105/12/61
work_keys_str_mv	AT cieslikmarcin alightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines AT muracameron alightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines AT cieslikmarcin lightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines AT muracameron lightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines
_version_	1725463132544958464

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

Similar Items