(adapted from Hadrien Gourle’s original)
In this practical you will learn to map NGS reads to a reference sequence, check the output using a viewer software and investigate some aspects of the results.
EHEC O157 strains generally carry a large virulence plasmid, pO157. Plasmids are circular genetic elements that many bacteria carry in addition to their chromosomes. This particular plasmid encodes a number of proteins which are known or suspected to be involved in the ability to cause severe disease in infected humans. Your task in this practical is to map your prepared read set to a reference sequence of the virulence plasmid, to determine if the pO157 plasmid is present in the St. Louis outbreak strain.
Illustration of plasmid integration into a host bacteria
—
As explained in the previous submodule, if you are on our training cluster, ensure that you are using a worker node (i.e. worker-1
,
worker-2
, worker-3
or worker-4
). Also ensure that you are are in the qc
conda environment (i.e. you have run conda activate qc
)
If you are on your own laptop, and you have not done this previously, ensure that you have all the software you need by running:
mamba create -n mapping -y bowtie2 samtools bcftools
and then use conda activate mapping
instead of conda activate qc
In the Quality Control tutorial you used a subset of 50,000 reads from sample SRR957824. In this tutorial we will use a
collection of about 500,000 reads that have been adapter and quality trimmed. Download the read dataset by going to your ~/work
directory (created
in the previous tutorial) - if you do not have it, create it with mkdir ~/work
) and downloading running these commands:
cd ~/work
wget -c https://www.dropbox.com/s/pgh5im31hf27pgt/SRR957824_500K_trimmed_R1.fastq.gz
wget -c https://www.dropbox.com/s/5szt9rneifwmdom/SRR957824_500K_trimmed_R2.fastq.gz
You will need a reference sequence to map your reads to.
cd ~/work
wget -O pO157_Sakai.fasta.gz -c https://osf.io/rnzbe/download
This file contains the sequence of the pO157 plasmid from the Sakai outbreak strain of E. coli O157. In contrast to the strain we are working on, this strain is available as a finished genome, i.e. the whole sequence of both the single chromosome and the large virulence plasmid are known.
Before aligning the reads against a reference, it is necessary to build an index of that reference
bowtie2-build pO157_Sakai.fasta.gz pO157_Sakai
One you have downloaded the two read files and the reference files their md5 checksums should be:
md5sum SRR957824_500K_trimmed_* pO157_Sakai.fasta.gz
3d3bfefc858c5b2f10985eed97334b84 SRR957824_500K_trimmed_R1.fastq.gz
0d2f830d3ecc8cb3b79142580cbfb510 SRR957824_500K_trimmed_R2.fastq.gz
0a5ab22c7e9b6f77c3bfb262c55e865b pO157_Sakai.fasta.gz
!!! note Indexing the reference is a necessary pre-processing step that makes searching for patterns much much faster. Many popular aligners such as Bowtie and BWA use an algorithm called the Burrows–Wheeler transform to build the index.
Now we are ready to map our reads
bowtie2 -x pO157_Sakai -1 SRR957824_500K_trimmed_R1.fastq.gz -2 SRR957824_500K_trimmed_R2.fastq.gz -S SRR957824.sam
The output of the mapping will be in the SAM format. You can find a brief explanation of the SAM format here
!!! note In this tutorial as well as many other places, you’ll often see the terms mapping and alignment being used interchangeably. If you want to read more about the difference between the two, I invite you to read this excellent Biostars discussion
head SRR957824.sam
But it is not very informative. In a few minutes, we’ll use samtools to visualise our data.
Most analysis tools work with BAM, a compressed version of the SAM output that bowtie2 generates. So we need to convert the SAM to BAM using this command:
samtools view -hSbo SRR957824.bam SRR957824.sam
Mapping data from bowtie2 is not sorted but most analysis tools require it to be sorted and indexed. Sort the BAM file per position in the genome and index it
samtools sort SRR957824.bam -o SRR957824.sorted.bam
samtools index SRR957824.sorted.bam
Finally we can visualise with samtools tview
. This is a text based user interface for viewing BAM files.
samtools tview SRR957824.sorted.bam
samtools tview
showing the aligned reads. Reads in uppercase are forward reads, reads in lowercase are reverse reads!!! tip
navigate in tview:
- left and right arrows scroll
- q
to quit
- CTRL-h
and CTRL-l
scrolls more
- g gi|10955266|ref|NC_002128.1|:8000
will take you to a specific location.
A frequent application for mapping reads is variant calling, i.e. finding positions where the reads are systematically different from the reference genome. Single nucleotide polymorphism (SNP)-based typing is particularly popular and used for a broad range of applications. For an EHEC O157 outbreak you could use it to identify the source, for instance.
We can call the variants using samtools mpileup
bcftools mpileup -f pO157_Sakai.fasta.gz SRR957824.sorted.bam | bcftools call -m -v -o SRR957824.vcf
Note that bcftools mpileup
will output the message maximum number of reads per input file set to -d 250
.
You can read about the structure of vcf files here.
The documentation is quite painful to read and take a look at the file
Look at the non-commented lines
grep -v ^## SRR957824.vcf
The first five columns are CHROM POS ID REF ALT.
Use
grep -v ^## SRR957824.vcf | less -S
for a better view.
!!! tip
Use your left and right arrows to scroll horizontally, and q
to quit.
!!! question How many SNPs did the variant caller find? Did you find any indels?
Examine one of the variants with tview
samtools tview SRR957824.sorted.bam pO157_Sakai.fasta.gz -p 'gi|10955266|ref|NC_002128.1|:43071'
Note that this time round, because we have included the reference in our samtools tview
, positions that match the reference are shown as ‘.’ in a forward
read and ‘,’ in a reverse read. When positions don’t match, the base is shown like at position 43071 in our example.
That variant seems very real!
!!! question Where do reference genomes come from?