(this practical is adapted from Hadrien Gourle’s original)
In this practical you will learn to import, view and check the quality of raw high thoughput sequencing sequencing data.
The first dataset you will be working with is from an Illumina MiSeq dataset. The sequenced organism is an enterohaemorrhagic E. coli (EHEC) of the serotype O157, a potentially fatal gastrointestinal pathogen. The sequenced bacterium was part of an outbreak investigation in the St. Louis area, USA in 2011. The sequencing was done as paired-end 2x150bp.
Important: We will be using the SANBI training cluster that can be accessed at [https://cluster.sanbi.ac.za]. When you login there you are on the login or “head node” of the cluster. To do the work for this tutorial, you need to be on a “worker node”. Run this command:
srun --pty bash
(Note that srun
is a command provided by the Slurm job scheduling system. To learn more about working on a cluster with Slurm, read
this guide)
This will start a bash
shell on of the cluster worker nodes. Your prompt will change, e.g.
(base) user1@head:~$ srun --pty bash
(base) user1@worker-1:~$
Each time you get logged out from the cluster you need to run srun --pty bash
again to reconnect with the cluster.
Next you need to activate the conda environment that contains the software we are using for this tutorial. To understand conda and conda environments, read about software installation. Run:
conda activate qc
To activate the qc
conda environment that has been preconfigured with the software that we will be using. Once again the prompt will
change:
(base) user1@worker-1:~$ conda activate qc
(qc) user1@worker-1:~$
If you see a prompt that looks like (qc) user1@worker-1:~$
(the hostname might be worker-2
or worker-3
) you are ready for the rest
of the tutorial.
The raw data were deposited at the European Nucleotide Archive, under the accession number SRR957824. You could go to the ENA website and search for the run with the accession SRR957824.
However these files contain about 3 million reads and are therefore quite big. We are only going to use a subset of the original dataset for this tutorial.
First create a data/
directory in your home folder
mkdir ~/data
now let’s download the subset
cd ~/data
curl -O -J -L https://www.dropbox.com/s/tj8dx30yjrohycf/SRR957824_50K_R1.fastq.gz
curl -O -J -L https://www.dropbox.com/s/2vukpj842zwi829/SRR957824_50K_R2.fastq.gz
Let’s make sure we downloaded all of our data using md5sum.
md5sum SRR957824_50K_R1.fastq.gz SRR957824_50K_R2.fastq.gz
you should see this
a47630c4f803ce4d0dbdc3374463b2f7 SRR957824_50K_R1.fastq.gz
4020c98370fe52ea08559cf64fdab70f SRR957824_50K_R2.fastq.gz
and now look at the file names and their size
ls -l
total 97M
-rw-r--r-- 1 hadrien 4.8M Nov 19 18:44 SRR957824_50K_R1.fastq.gz
-rw-r--r-- 1 hadrien 5.0M Nov 19 18:53 SRR957824_50K_R2.fastq.gz
There are 50 000 paired-end reads taken randomly from the original data
One last thing before we get to the quality control: those files are writeable. By default, Linux makes things writeable by the file owner. This poses an issue with creating typos or errors in raw data. We fix that before going further
chmod u-w *
First we make a work directory: a directory where we can play around with a copy of the data without messing with the original
mkdir ~/work
cd ~/work
Now we make a link of the data in our working directory. A link created an alias for a file, in other words, a shortcut that refers to the files that we have downloaded. You can read more about links here.
ln -s ~/data/* .
The files that we’ve downloaded are FASTQ files. Take a look at one of them with
zless SRR957824_50K_R1.fastq.gz
!!! tip Use the space bar to scroll down, and type ‘q’ to exit ‘less’
You can read more on the FASTQ format here.
!!! question Where does the filename come from?
!!! question Why are there 1 and 2 in the file names?
To check the quality of the sequence data we will use a tool called FastQC.
FastQC has a graphical interface and can be downloaded and run on a Windows or Linux computer without installation. It is available here.
However, FastQC is also available as a command line utility on the training server you are using. To run FastQC on our two files
fastqc SRR957824_50K_R1.fastq.gz SRR957824_50K_R2.fastq.gz
and look what FastQC has produced
ls *fastqc*
For each file, FastQC has produced both a .zip archive containing all the plots, and a html report.
Download and open the html files with your favourite web browser.
Alternatively you can look a these copies of them:
!!! question What should you pay attention to in the FastQC report?
!!! question Which file is of better quality?
Pay special attention to the per base sequence quality and sequence length distribution. Explanations for the various quality modules can be found here. Also, have a look at examples of a good and a bad illumina read set for comparison.
You will note that the reads in your uploaded dataset have fairly poor quality (<20) towards the end. There are also outlier reads that have very poor quality for most of the second half of the reads.
Now we’ll do some trimming!
There are several tools (including flexar, trimmomatic, trimgalore, fastp and more) that remove Illumina sequencing adapters from sequences and perform quality trimming. Before doing this analysis, you might enjoy reading this explanation of adapter trimming from Illumina.
We will use two tools, flexbar
and trimmomatic
, with flexbar
doing the adapter trimming and trimmomatic
doing the quality trimming.
The first thing we need is the adapters to trim off
curl -O -J -L https://osf.io/v24pt/download
Now we run flexbar
on both our read files
flexbar -z GZ -a adapters.fasta --reads SRR957824_50K_R1.fastq.gz --reads2 SRR957824_50K_R2.fastq.gz
!!! question What adapters do you use? You can find out more about the adapters used with different Illumina kits here
!!! question Which adapter was found most often in the reads? Did you spot it in the FastQC results?
Most modern sequencing technologies produce reads that have deteriorating quality towards the 3’-end and some towards the 5’-end as well. Incorrectly called bases in both regions negatively impact assembles, mapping, and downstream bioinformatics analyses.
We will trim each read individually down to the good quality part to keep the bad part from interfering with downstream applications.
To do so, we will use `trimmomatic. Trimmomatic is a flexible read-trimming tool that can apply several different filters. You can read more about Trimmomatic’s filters here. The filters that we will use are the sliding window filter that looks for where the average (mean) quality of bases drops below a certain value in a window that gets moved across the reads from start to end, and the minimum length filter which will discard all reads shorter than a minimum length.
The trimmomatic
command line is long, so you might want to cut and paste it.
trimmomatic PE flexbarOut_1.fastq.gz flexbarOut_2.fastq.gz trimmed_paired_r1.fastq.gz trimmed_unpaired_r1.fastq.gz trimmed_paired_r2.fastq.gz trimmed_unpaired_r2.fastq.gz SLIDINGWINDOW:4:20 MINLEN:20
The output of trimmomatic
should be:
TrimmomaticPE: Started with arguments:
flexbarOut_1.fastq.gz flexbarOut_2.fastq.gz trimmed_paired_r1.fastq.gz trimmed_unpaired_r1.fastq.gz trimmed_paired_r2.fastq.gz trimmed_unpaired_r2.fastq.gz SLIDINGWINDOW:4:20 MINLEN:20
Multiple cores found: Using 4 threads
Quality encoding detected as phred33
Input Read Pairs: 43132 Both Surviving: 41616 (96.49%) Forward Only Surviving: 1025 (2.38%) Reverse Only Surviving: 265 (0.61%) Dropped: 226 (0.52%)
TrimmomaticPE: Completed successfully
Examine the output files with
ls -lh *.fastq.gz
!!!question Why are the unpaired outputs smaller than the paired outputs? And if the reads were originally all paired, why are there now unpaired reads?
Run fastqc again on the filtered reads
fastqc trimmed_paired_r*
and look at the reports
MultiQC is a tool that aggregates results from several popular QC bioinformatics software into one html report.
Let’s run MultiQC in our current directory
multiqc .
You can download the report or view it by clicking on the link below
!!! question What did the trimming do to the per-base sequence quality, the per sequence quality scores and the sequence length distribution?
We often want to combine different analyses into a bioinformatics workflow. Writing a shell script is a simple way to produce such a workflow.
Using nano create a file called qc.sh
with these contents:
#!/bin/bash
# the line above is a "shebang line" that tells the shell how to run this script
# tell bash to exit immediately if there are any errors
set -e
# check that we get two parameters, READ1 and READ2
if [ $# != 2 ] ; then
echo "Usage: qc.sh <READ1> <READ2>\n" >&2
exit 1
fi
# check that we have the adapter sequences
if [ ! -f adapters.fasta ] ; then
echo "Adapter sequences missing" >&2
exit 1
fi
READ1=$1
READ2=$2
for filename in $READ1 $READ2 ; do
if [ ! -f $filename ] ; then
echo "Read filename $filename is missing" >&2
exit 1
fi
done
# use the first part of READ1 name for the base name of the reads
# this assumes that the reads are called something ending with _R1.fastq.gz and _R2.fastq.gz
BASE_READ_NAME=$(echo $READ1 |sed 's/_R1.fastq.gz//')
WORKING_DIR="${BASE_READ_NAME}_tmp"
# delete the working directory if it already exists
if [ -d $WORKING_DIR ] ; then
rm -rf $WORKING_DIR
fi
mkdir $WORKING_DIR
cd $WORKING_DIR
echo Running FastQC on untrimmed reads
mkdir before
fastqc -o before ../$READ1 ../$READ2
echo Trimming adapters with flexbar
# this assumes that we are running the script from a directory with the ad
flexbar -z GZ -a ../adapters.fasta --reads ../$READ1 --reads2 ../$READ2
echo Quality trimming with trimmomatic
trimmomatic PE flexbarOut_1.fastq.gz flexbarOut_2.fastq.gz trimmed_paired_r1.fastq.gz trimmed_unpaired_r1.fastq.gz trimmed_paired_r2.fastq.gz trimmed_unpaired_r2.fastq.gz SLIDINGWINDOW:4:20 MINLEN:20
mv trimmed_paired_r1.fastq.gz ${BASE_READ_NAME}_trimmed_R1.fasta.gz
mv trimmed_paired_r2.fastq.gz ${BASE_READ_NAME}_trimmed_R2.fasta.gz
echo Running FastQC on trimmed reads
mkdir after
fastqc -o after ${BASE_READ_NAME}_trimmed_R1.fasta.gz ${BASE_READ_NAME}_trimmed_R2.fasta.gz
for directory in ../before ../after ; do
if [ ! -d $directory ] ; then
mkdir $directory
fi
done
cp before/* after/* .
echo Running multiqc
multiqc --filename ${BASE_READ_NAME}_multiqc.html .
cp before/* ../before
cp after/* ../after
cp ${BASE_READ_NAME}_multiqc.html ..
cp ${BASE_READ_NAME}_trimmed_R1.fasta.gz ${BASE_READ_NAME}_trimmed_R2.fasta.gz ..
echo Deleting working directory
rm -rf $WORKING_DIR
echo DONE
You can download this script with:
curl -J -O -L 'https://www.dropbox.com/s/hvgirjqememlo92/qc.sh'
Then make this script executable with:
chmod a+x qc.sh
and run it with:
./qc.sh SRR957824_50K_R1.fastq.gz SRR957824_50K_R2.fastq.gz