A typical bioinformatics workflow involves several tools, each of which may depend on other tools and software libraries (known as dependencies). Modern operating systems come with package managers (e.g. apt on Ubuntu Linux, dnf on Fedora Linux and Microsoft Store on Windows) but often they don’t include packages for bioinformatics software.
The conda package management system has emerged to make it easier to install software packages, including bioinformatics packages. Conda was originally developed by Anaconda.org for installing Python packages, but it has now been adopted by several different communities.
The original Anaconda package distribution contains a complete toolkit for data science with Python. As the conda package manager became a popular tool for installing packages, a smaller distribution of conda with only the necessary supporting software called Miniconda was developed.
Conda installs software by comparing the list of software the user already has with the list of dependencies needed for the new software to be installed. This process is called solving dependencies and, with the original conda tool, can be very time consuming. To address the problem of long install times, QuantStack developed mamba. Mamba speeds up installations of conda package.
Both conda and mamba install software from the same collections of packages. These collections are called channels. The Anaconda.org developers produced the original, base channel and, since then, two important channels have been produced by developer communities:
conda-forge is a collection of general purpose and scientific software
bioconda is a collection of software for bioinformatics
The conda-forge community also created a distribution called Mambeforge that comes with mamba pre-installed. This makes it a good starting point for software installation.
All computers run a Operating System or OS, for example, Linux, Windows or MacOS. When software is packaged, it is made available for a specific operating system. In addition, a computer has a Central Processing Unit (CPU) that interprets the machine language instructions that make a computer work. Each CPU has its own architecture, and only programs built for this architecture will work on the particular CPU. Most laptop and server computers in use at present use the x86 architecture (also known as amd64). The latest Mac computers, however, use the M1 architecture.
Software package managers can only install software if it is built for the correct operating system and architecture.
To get started, you need to download the correct installer for the package management system. Since we will be using Mambaforge, download the correct release for your operating system.
For Linux on x86_64 / amd64, download using:
wget -c https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
(the -c
command line flag will ensure that your download will continue if it somewhat got interrupted)
for MacOS on x86_64 / amd64, download using:
curl -O -J -L -C - https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-MacOSX-x86_64.sh
Then install the downloaded installer by running it using bash
:
For Linux:
bash Mambaforge-Linux-x86_64.sh
For MacOS:
bash Mambaforge-MacOSX-x86_64.sh
The installer will ask you to agree to a License Agreement. Press the space bar to page through the license agreement
and then type yes
and hit Enter to accept it. The installation then proceeds. Finally you are asked if you want to
run conda init
. Again type yes
and press Enter. The installer changes your login settings (in .bashrc
), so after
installation you should log out and then log back in again.
A log of what you can expect to see during the install can be seen here.
Once you have conda installed, you should see a (base)
added to your prompt e.g. (base) user1@head:~$
. This means that conda has
been added to your shell login and is ready for use. As mentioned above, software developers make software available in channels
and these need to be added to your conda configuration. Run these commands (taken from bioconda instructions):
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict
This configures your conda to search for packages from the conda-forge and bioconda channels in addition to the default channel provided by Anaconda.org. You are now ready to install software.
One of the major advantages of conda as a package manager is that it makes it easy to have multiple environments. An environment is a collection
of software packages. Because not all software works well together (e.g. one package might use Python 2.7, another might use Python 3), best
practice is to make a new environment for each task that you want to perform and only install related software packages in it. When you start
in conda you are in the base
environment (this is what the (base)
in your prompt means). In general, do not install software into the
base
environment!
For our practice, we want to install some quality control and trimming packages. While we could (and perhaps should) make a different
environment for each software package, for ease of use we will create a single environment called qc
using the command:
mamba create -n qc -y fastqc multiqc flexbar trimmomatic
Note that we are using mamba
because this allows for faster installations than using the conda
command (the two commands might merge their
features in the future though). The -y
flag tells mamba
to install packages without checking with you for further confirmation and the
-n qc
flag sets the name of the environment being created.
Once the installation of packages (which might take some minutes) is complete, you can use the environment with
conda activate qc
This will change your prompt to something like (qc) user1@head:~$
. You can switch back to the base environment by using conda deactivate
.
Note that if you happen to use conda deactivate
too many times, you will completely deactivate conda. Then, to re-activate conda, you should
either log out and log back in or manually activate conda with source $HOME/mambaforge/bin/activate
.
Our discussion of software installation has focused on package management using conda and mamba. Another technology that is emerging for software management is software containers. Containers technology is available on Linux and (with some extra effort) MacOS and Windows with WSL2. Instead of installing software on your computer, they provide a custom set of software all gather together into a single “container image”.
Containers are beyond the scope of this course but if you want to know more about them, you could read up on Docker for containerised bioinformatics and Singularity containers.