Do you know what a pipeline is?

A pipeline, as its name suggests, indicates a set of pieces thoughtfully placed, such as liquids or gases can be taken from one place to another in a structured way. This concept has been used as a popular analogy, by different areas, to give a representation of complex processes, in which a particular step depends on its predecessors and is crucial to downstream ones.

Instead of pipes, however, professionals use diagrams to represent a particular system. In these depicts, also known as workflows, different steps are interconnected in a way, one can comprehend and most importantly, visualize the full system and take control of its breakpoints, inputs, outputs, or even propose modifications to the benefit of the whole operation.

In Bioinformatics, this concept is extensively used, once many tools are usually interconnected. In this case, the output files of one software are used to others, such as the desired outcome is achieved by the end. These files manipulations are performed by executable command line tools written for Unix-compatible operating systems and allow a one-to-go analysis. This process occurs in a way the pipeline gets in charge of automatically producing the intermediate files, necessary to the achievement of the final output based on a set of input files.

12859_2019_2928_Fig1_HTML

As an example, let’s check a common pipeline for variant analysis based on Next Generation Sequencing (NGS) data. As illustrated in the above picture, the pipeline consists of pre-processing raw reads, followed by mapping them in a reference genome. After an optional post-processing step, the variant calling is performed, based on the file that represent the aligned reads.

As we can see, different tools are used to produce output files that are used by downstream ones, in an engaged process, i.e. without the need of running different tools in a separate way. As a consequence, users executing this pipeline, would, easily, have the knowledge of their sample’s variants, providing only the files with sequencing reads.

However, it’s worth mentioning that even though things can get simpler, it doesn’t mean things are always too simple. For instance, a closed pipeline, where tools and parameters are set in a hardly variable fashion, would suffice in the analysis of only a specific set of data. On the other hand, for other types of data, changes in the intermediate steps of the pipeline would be necessary.

In general, to design a pipeline, developers must take into account the purpose of the analysis and target users. These would allow answering key questions that will drive the best pipeline design:

  • Does the pipeline need to be flexible?
  • What set of default tools and parameters are best suited for the desired output?
  • How to deal with unexpectancies?
  • How computer intensive must the pipeline be?
  • A graphical user interface is necessary?

These questions are related to three important concepts, as defined by Leipzig, 2017:

Ease of development: refers to the effort required to compose workflows and also wrap new tools, such as custom or publicly available scripts and executables.
Ease of use: refers to the effort required to use existing pipelines to process new data, such as samples, and also the ease of sharing pipelines in a collaborative fashion.
Performance: refers to the efficiency of the framework in executing a pipeline, in terms of both parallelization and scalability.”

What about now? Shall we start building our own pipes?

References:

[1] Kumaran M, Subramanian U, Devarajan B. Performance assessment of variant calling pipelines using human whole exome sequencing and simulated data. BMC Bioinformatics. 2019;20(1):342. Published 2019 Jun 17. doi:10.1186/s12859-019-2928-9

[2] Leipzig J. A review of bioinformatic pipeline frameworks. Brief Bioinform. 2017;18(3):530–536. doi:10.1093/bib/bbw020

Footer blog-01

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.