Scientific Workflow Languages

In our Introduction to the Command Line we introduced the idea of a “shell script” to collect together commands into a single block of re-useable analysis. Such shell scripts are written in a shell language like Bash and can also take parameters so that they act like custom-built commands. We also introduced the idea of variables and loops, including parameter variables like $1 and $2.

While shell scripts provide a first step into producing reproducible analyses, there are two significant issues that they don’t address:

  1. The software used for analysis needs to be installed. This is known as dependency management. We have seen how packaging systems like conda can help with dependency management and briefly mentioned the existence of the Docker and Singularity software container systems.

  2. Analyses can run in many environments (e.g. on a laptop, on a computing cluster and in cloud resources) and tasks can sometimes be split up and run in parallel.

Scientific Workflow Languages and Scientific Workflow Management Systems extend computing capabilities to address these concerns. A scientific workflow language allows for the steps of analysis to be specified in modular units and a workflow management system takes such a specification, manages the installation of dependencies and executes the analysis.

In this course we have already seen Galaxy, a web based bioinformatics environment that contains a workflow management system. Galaxy workflows are created from blocks of analysis known as tools and edited in a graphical workflow editor. They can also be shared either as downloaded files (in a JSON-based format) or via workflow publishing services like Dockstore or WorkflowHub.EU.

In addition to Galaxy, several command-line oriented workflow management systems exist. Some commonly used examples are Nextflow, Snakemake, the Common Workflow Language (CWL) and the Workflow Description Language (WDL).

We will be focusing on Nextflow because it is widely used, has an activate community of workflow-publisher (e.g. the nf-core project) and has useful features such as dependency management and support for resuming workflows.

Our training was drawn from the Nextflow Basic Training materials. After you have installed Nextflow (see the notes on that here), start with the training Introduction.

Key points on Nextflow

Nextflow workflows are written in a language called Groovy and are composed out of modules called processes that are combined into a workflow using channels. Each process has input and output channels that consume and produces values such as file system paths and text strings. For more on channels read here and on processes read here. Finally, operators (such as mix() and collect()) are used to transform the contents of channels and thereby change how workflows execute. Read more about operators here.

Each Nextflow workflow can accept parameters via the special purpose params variables. These parameters can either be set on the command line or provided via a nextflow.config file that is typically in the directory that you run your workflow from. If they are set on the command line they are set using double -, e.g.

nextflow run -resume mapping.nf --reads data/reads/*_{R1,R2}.fastq.gz --reference data/reference.fasta

In the above example -resume is a flag passed to the Nextflow runner and --reads and --reference set the variables params.reads and params.reference respectively. If parameters are set in a nextflow.config file they are in a params block. For example, such a file could look like this:

params {
    reads = "$projectDir/data/gut_{1,2}.fq"
    reference = "$projectDir/data/reference.fasta"
}

Dependencies in Nextflow can be managed using either the Docker or Singularity container system or using Conda. Container management and execution systems bundle software together into a container that provides a filesystem with all the necessary software to support an analysis. Information on installing Docker can be found on the here. Singularity differs from Docker in that container images are supplied as files and are run as the currently logged-in user. On shared computing resources, many IT admins prefer Singularity over Docker because of the security benefits. Instructions on installed Singularity are here on the Apptainer website (Apptainer is a development project providing Singularity software). For Linux, the installation from pre-built packages (described here) is recommended.

The bioconda project automatically builds Docker and Singularity containers for each conda package that they provide packages for. There are currently more than 10,000 packages available this way. StaPH-B also provides some 150 containers for commonly used bioinformatics tools. (StaPH-B is the State Public Health Bioinformatics Group, an association of bioinformaticians working in Public Health labs in states of the USA)

Dependencies can either be associated with the Nextflow environment as a whole (and thus configured in the nextflow.config) or they can be specified for a particular process. Read more about how this is done in this section of the training.

Nextflow communities

The nf-core community is an especially active community using Nextflow. It provides online training and also runs a Slack discussion forum.