Genome annotation using Prokka

Background

In this section we will use a software tool called Prokka to annotate a draft genome sequence.

Prokka is a “wrapper”; it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”.
Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence.
Note, Prokka uses a two-step process for the annotation of protein coding regions: first, protein coding regions on the genome are identified using Prodigal; second, the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases.
Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats.
More information about Prokka can be found here.

New to Galaxy? First try the introduction and then learn some key tasks

Input data

Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).

Prokka requires assembled contigs.

If you are continuing on from the previous workshop (Assembly with Spades), this file will be in your current history named something like SPAdes contigs(fasta).
Or, to upload a file of contigs:

Use shared data

If you are using Galaxy Australia, you can import the data from a shared data library.

In the top menu bar, go to Shared Data.

Click on Data Libraries.
Click on Galaxy Australia Training Material: Annotation: Microbial Annotation.
Tick the box next to the file.
Click the To History button, select As Datasets.
Name a new history and click Import.
In the top menu bar, click Analyze Data.
You should now have one file in your current history.

Or, import from the web

Only follow this step if unable to load the data files from shared data, as described above.

In a new browser tab, go to this webpage:

Find the file called contigs.fasta
Right click on file name: select “copy link address”
In Galaxy, go to Get Data and then Upload File
Click Paste/Fetch data
A box will appear: paste in link address
Click Start
Click Close
The file will now appear in the top of your history panel. When uploaded, the file name will turn green.

Shorten file name

Click on the pencil icon next to the file name.
In the centre Galaxy panel, click in the box under Name
Shorten the file name.
Then click Save

Run Prokka

Go to the Tool panel and search for “prokka” in the search box.
Click on Prokka
For Contigs to annotate: your contigs.fasta file
All the other settings can be left as they are.

Your tool interface should look like this:

prokka interface

Click Execute

How do I choose settings when running a tool?

In this case, the default settings are appropriate for our data set and analysis.
Under the tool interface in Galaxy there will usually be a more detailed description of the tool options, and a link to the tool’s documentation.
It is recommended that you read about the tool parameters in more detail in the documentation, and adjust to your data and analysis accordingly.

Examine the output

Once Prokka has finished, examine each of its output files.

The GFF and GBK files contain all of the information about the features annotated (in different formats.)
The .txt file contains a summary of the number of features annotated.
The .faa file contains the protein sequences of the genes annotated.
The .ffn file contains the nucleotide sequences of the genes annotated.

View annotated features in JBrowse

Now that we have annotated the draft genome sequence, we would like to view the sequence in the JBrowse genome viewer.

Go to the Galaxy tool panel, and use the top search box to search for “JBrowse”.
Click JBrowse

Leave most settings as they are, except for:

Under Reference genome to display choose Use a genome from history.
Under Select the reference genome choose Prokka on data XX:fna. This .fna sequence is the fasta nucleotide sequence, and will be the reference against which annotations are displayed.

For Genetic Code choose 11: The Bacterial, Archaeal and Plant Plastid Code.

Click Insert Track Group
Under Track Category type in gene annotations.
Click Insert Annotation Track
For Track Type choose GFF/GFF3/BED/GBK Features
For GFF/GFF3/BED Track Data select Prokka on data XX:gff [Note: not wildtype.gff]

Your tool interface should look like this:

JBrowse interface

Click Execute

JBrowse output

A new file will be created, called JBrowse on data XX and data XX - Complete.

In place of “XX”, there will be numbers that will refer to the files that Galaxy used in your particular history.

Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.

Under Available Tracks on the left, tick the box for Prokka on data XX:gff.
Select a contig to view, in the drop down box. You can only see one contig displayed at a time. Choose the longest contig.

JBrowse

Use the plus and minus buttons to zoom in and out, and the arrows to move left or right (or click and drag within the window to move left or right).
Zoom in to see the reference sequence at the top. JBrowse displays the sequence and a 6-frame amino acid translation.

Zoomed in view:

JBrowse

Right click on a gene/feature annotation (the bars on the annotation track), then select View Details to see more information.
- gene name
- product name
- you can download the FASTA sequence by clicking on the disk icon.

Extension exercise

Open the JBrowse file that you created to show the gene annotations.
In the drop-down box, select the longest contig.
Find an annotation for a “hypothetical protein”.
- Alternatively, in the coordinates box, enter “32,500”.
- Zoom out; see the annotation that spans this location. It is a “hypothetical protein”.
Click on this annotation. See the nucleotide sequence: select and copy.
Go to the NCBI page and BLAST this sequence to see what it matches.
Try the “blastx” option, which will translate your nucleotide sequence into a protein sequence.
For Enter Query Sequence, paste your sequence into the box.
For Genetic Code choose “Bacteria and Archaea”.
For Database, try the “SwissProt” database. You can also re-try with other options to see how the database affects the results.
All other options can be left as default. Click BLAST.
What does your sequence match?
In our example, the top hit is to a carboxylase enzyme.
In the Show Conserved Domains window, we can see that this is part of a superfamily of metallo-dependent hydrolases. Click through to find out more about this superfamily.
Genome annotation is constantly refined. We can see here that some manual investigation gave us more information about an annotation. This would need to be experimentally confirmed.
For a detailed description of the BLAST output, see the top right corner of the page and click “Blast report description”.

See this history in Galaxy

If you want to see this Galaxy history without performing the steps above:

Log in to Galaxy Australia: https://usegalaxy.org.au/
Go to Shared Data
Click Histories
Click Completed-annotation-analysis
Click Import (at the top right corner)
The analysis should now be showing as your current history.

More information

Here are some references covering more information about genome annotation.

** Prokaryote genome annotation: ** Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068–9.

** Eukaryote genome annotation: ** Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012 Apr 18;13(5):329–42.

** Wheat genome annotation: ** Clavijo BJ, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017 May;27(5):885–96.

** Human genome annotation: ** Harrow J, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012 Sep;22(9):1760–74.

** Annotation in the clinical setting: ** Steward CA, Parker APJ, Minassian BA, Sisodiya SM, Frankish A, Harrow J. Genome annotation for clinical genomic diagnostics: strengths and weaknesses. Genome Med. 2017 May 30;9(1):49.

** Assessing your annotation: ** Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015 Oct 1;31(19):3210–2.

** Comparative genome annotation: ** König S, Romoth L, Stanke M. Comparative Genome Annotation. Methods Mol Biol. 2018;1704:189–212.

What’s next?

To use the tutorials on this website:

← see the list in the left hand panel
↖ or, click the menu button (three horizontal bars) in the top left of the page

You can find more tutorials at the Galaxy Training Network:

http://galaxyproject.github.io/training-material/