Assembly using Spades

Spades is one of a number of de novo assemblers that use short read sets as input (e.g. Illumina Reads), and the assembly method is based on de Bruijn graphs.

For information about Spades see this link.

New to Galaxy? First try the introduction and then learn some key tasks

The data

The read set for today is from an imaginary Staphylococcus aureus bacterium with a miniature genome.

The whole genome shotgun method used to sequence our mutant strain read set was produced on an Illumina DNA sequencing instrument.
The files we need for assembly are the mutant_R1.fastq and mutant_R2.fastq.
The reads are paired-end.
Each read is 150 bases long.
The number of bases sequenced is equivalent to 19x the genome sequence of the wildtype strain. (Read coverage 19x - rather low!).

Import the data

Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).

Use shared data

If you are using Galaxy Australia, you can import the data from a shared data library.

In the top menu bar, go to Shared Data.

Click on Data Libraries.
Click on Galaxy Australia Training Material: Assembly: Microbial Asssembly.
Tick the boxes next to the two files.
Click the To History button, select As Datasets.
Name a new history and click Import.
In the top menu bar, click Analyze Data.
You should now have two files in your current history.

Or, import from the web

Only follow this step if unable to load the data files from shared data, as described above.

In a new browser tab, go to this webpage:

Find the file called mutant_R1.fastq
Right click on file name: select “copy link address”
In Galaxy, go to Get Data and then Upload File
Click Paste/Fetch data
A box will appear: paste in link address
Click Start
Click Close
The file will now appear in the top of your history panel.
Repeat for mutant_R2.fastq.

Shorten file names

Click on the pencil icon next to the file name.
In the centre Galaxy panel, click in the box under Name
Shorten the file name to mutant_R1.fastq
Then click Save

We now have two FASTQ read files in our history.

Click on the eye icon next to one of the FASTQ sequence files.
View the file in the centre Galaxy panel.

Quality Control

If you want to check the quality of your reads, see the Quality Control tutorial.

Note: Skip over the “Import Data” section and instead use the file called mutant_R1.fastq that is already in your current history.

Assemble the reads

We will perform a de novo assembly of the mutant FASTQ reads into long contiguous sequences (in FASTA format.)

Go to the Tool panel and search for “spades” in the search box.
Click on SPAdes
Set the following parameters (leave other settings as they are):
- Run only Assembly: Yes [the Yes button should be darker grey]
- Kmers to use separated by commas: 33,55,91 [note: no spaces]
- Coverage cutoff: auto
- Files → Forward reads: mutant_R1.fastq
- Files → Reverse reads: mutant_R2.fastq
Your tool interface should look like this:

Spades interface

Click Execute

How do I choose settings when running a tool?

In this case, most of the default settings are appropriate for our data set and analysis.
Under the tool interface in Galaxy there will usually be a more detailed description of the tool options, and a link to the tool’s documentation.
It is recommended that you read about the tool parameters in more detail in the documentation, and adjust to your data and analysis accordingly.

Examine the output

Galaxy is now running Spades on the reads for you.
When it is finished, you will have five (or more) new files in your history, including:
- two FASTA files of the resulting contigs and scaffolds
- two files for statistics about these
- the Spades logfile

spades output

To view the output, click on the eye icon next to each of the files.
Note that the short reads have been assembled into much longer contigs.
(However, in this case, the contigs have not been assembled into larger scaffolds.)
The stats files will give you the length of each of the contigs, and the file should look something like this:

spades output contigs

Extension exercise

Look at the contigs.stats file.
Find a contig that seems to have high coverage relative to the other contigs.
Extract this sequence from the contigs.fasta file. Select the sequence for the contig (called a Node) of interest, and copy.
Go to the NCBI page and BLAST this sequence to see what it matches.
Try the “blastx” option, which will translate your nucleotide sequence into a protein sequence.
For Enter Query Sequence, paste your sequence into the box.
For Genetic Code choose “Bacteria and Archaea”.
For Database, try the “SwissProt” database. You can also re-try with other options to see how the database affects the results.
All other options can be left as default. Click BLAST.
What does your sequence match?
Does this suggest that the sequence is a repeat region in this bacterial genome?
For a detailed description of the output, see the top right corner of the page and click “Blast report description”.

See this history in Galaxy

If you want to see this Galaxy history without performing the steps above:

Log in to Galaxy Australia: https://usegalaxy.org.au/
Go to Shared Data
Click Histories
Click Completed-assembly-analysis
Click Import (at the top right corner)
The analysis should now be showing as your current history.

More information

Here are some references covering more information about genome assembly.

** More about de Bruijn graphs: ** Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol. 2011 Nov 8;29(11):987–91.

** An assembler for long reads: ** Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation Genome Res. 2017 May;27(5):722-736.

** An assembler for large genomes: ** Jackman SD, Vandervalk BP, Mohamadi H, Chu J, Yeo S, Hammond SA, Jahesh G, Khan H, Coombe L, Warren RL, Birol I. ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res. 2017 May;27(5):768–77.

** Visualizing genome assemblies: ** Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015 Oct 15;31(20):3350–2.

** Yeast genome assembly: ** Goodwin S, Gurtowski J, Ethe-Sayers S, Deshpande P, Schatz MC, McCombie WR. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015 Nov;25(11):1750–6.

** Animal genome assembly: ** Austin CM, Tan MH, Harrisson KA, Lee YP, Croft LJ, Sunnucks P, Pavlova A, Gan HM. De novo genome assembly and annotation of Australia’s largest freshwater fish, the Murray cod (Maccullochella peelii), from Illumina and Nanopore sequencing read. Gigascience. 2017 Aug 1;6(8):1–6.

** Human genome assembly: ** Chaisson MJP, Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet. 2015 Nov;16(11):627–40.

** Plant genome assembly: ** Jiao W-B, Schneeberger K. The impact of third generation genomic technologies on plant genome assembly. Curr Opin Plant Biol. 2017 Apr;36:64–70.

What’s next?

To use the tutorials on this website:

← see the list in the left hand panel
↖ or, click the menu button (three horizontal bars) in the top left of the page

You can find more tutorials at the Galaxy Training Network:

http://galaxyproject.github.io/training-material/