Intro to Pathogen Bioinformatics

Workflows for SARS-CoV-2 amplicon sequencing

As previously described, SARS-CoV-2 is often sequenced using the amplicon sequencing method. This amplifies the viral genome using two pools of PCR primers and identifies variation in the genome by mapping against a reference genome. The reference genome used is typically the Wuhan Hu-1 sequence (MN908947.3 aka NC_045512.2). There are workflows available for Galaxy for both Illumina and Oxford Nanopore data.

Importing a workflow into Galaxy

To import a workflow into your Galaxy account, select the Workflow menu and click the Import button in the top right. Then select the GA4GH servers tab and search for the SANBI workflows by entering name:SANBI in the search box.

Select either the Illumina or ONT workflows and click on them to display the workflow details. Import the latest version of the workflow that you are going to use.

Getting Reference Data for your workflow

You will need the SARS-CoV-2 reference genome (MN908947.3) and ARTIC v4 primers by coping this history:

SARS-CoV-2 reference data

Running the SARS-CoV-2 Illumina workflow

Import the reads from the Illumina Reads history. Create a list of pairs collection from these reads by selecting all of the read files and using the Build List of Dataset Pairs option. Then:

Select Unpair all to un-pair the reads
Enter _1.fastq.gz as the filter on the left and _2.fastq.gz as the filter on the right.
Select Auto-pair to pair the reads while trimming the read names.
Choose a name for your samples (e.g. Samples)

Once you have the reads organised into a list of pairs and you also have the reference genome and primer BED file in your history, you can run the SARS-CoV-2 Illumina Amplicon pipeline. Choose your sequence reads (e.g. the collection called Samples) as your Paired read collection for samples input and then choose the right datasets for the Reference FASTA and Primer BED files. Leave other parameters at their defaults and then Run Workflow

Running the SARS-CoV-2 Nanopore workflow

Import the reads from the Oxford Nanopore history. Create a list collection by selecting the all the read files and using the Build Dataset List option. Give the list a name (e.g. Samples).

Interpreting the results

Coming Soon

Analysing your own data

To analyse your own data you need to:

Upload your data into a Galaxy history and organise it into a collection. See the notes in the Galaxy Tips on how to organise data into collections. Each sample should be in its own file.
Ensure that you have uploaded a copy of the SARS-CoV-2 MN908947 Reference Genome to your history. The link to upload is https://www.ebi.ac.uk/ena/browser/api/fasta/MN908947.3?download=true.
Ensure that you have the correct primer BED file that matches you sequencing primers. You can choose from the table below. Copy the link and paste it into the Galaxy upload form. Before you upload the dataset choose bed as your format because Galaxy often cannot automatically detect that a file is in BED format and will assign it to tabular format. If you find that your uploaded data is not in BED format, edit it (with the Pencil icon), go to the Convert tab and change the datatype in the bottom section of the form (in the New Type section, not the Target datatype section).

Amplicon Scheme	Amplicon Size	Notes
ARTIC v3	400
ARTIC v4	400
ARTIC v4.1	400
Midnight v1	1200	ONT kit MRT001.10
Midnight v2	1200	ONT kit MRT001.20
ONT Midnight v3	1200	ONT kit MRT001.30

If you are using Midnight primers for Oxford Nanopore data, the minimum size should be 150 and the maximum size 1200 (this is because the Nanopore Midnight use the rapid library preparation chemistry and thus tagmentation, see this post).

Previous submodule:

Galaxy Tips for Pathogen Bioinformatics

Next submodule:

SARS-CoV-2 Practice Exercises