Genome annotation using Prokka
In this section we will use a software tool called Prokka to annotate a draft genome sequence.
- Prokka is a “wrapper”; it collects together several pieces of software (from various authors), and so avoids “re-inventing the wheel”.
- Prokka finds and annotates features (both protein coding regions and RNA genes, i.e. tRNA, rRNA) present on on a sequence.
- Note, Prokka uses a two-step process for the annotation of protein coding regions: first, protein coding regions on the genome are identified using Prodigal; second, the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases.
- Prokka is a software tool that can be used to annotate bacterial, archaeal and viral genomes quickly, generating standard output files in GenBank, EMBL and gff formats.
- More information about Prokka can be found here.
- Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).
Prokka requires assembled contigs.
If you are continuing on from the previous workshop (Assembly with Spades), this file will be in your current history named something like
Or, to upload a file of contigs:
Use shared data
If you are using Galaxy Australia, you can import the data from a shared data library.
In the top menu bar, go to
- Click on
- Click on
Galaxy Australia Training Material: Annotation: Microbial Annotation.
- Tick the box next to the file.
- Click the
To Historybutton, select As Datasets.
- Name a new history and click
- In the top menu bar, click
- You should now have one file in your current history.
Or, import from the web
Only follow this step if unable to load the data files from shared data, as described above.
- In a new browser tab, go to this webpage:
- Find the file called
- Right click on file name: select “copy link address”
- In Galaxy, go to
Get Dataand then Upload File
- A box will appear: paste in link address
- The file will now appear in the top of your history panel. When uploaded, the file name will turn green.
Shorten file name
- Click on the pencil icon next to the file name.
- In the centre Galaxy panel, click in the box under
- Shorten the file name.
- Then click
- Go to the Tool panel and search for “prokka” in the search box.
Contigs to annotate: your contigs.fastafile
- All the other settings can be left as they are.
Your tool interface should look like this:
How do I choose settings when running a tool?
- In this case, the default settings are appropriate for our data set and analysis.
- Under the tool interface in Galaxy there will usually be a more detailed description of the tool options, and a link to the tool’s documentation.
- It is recommended that you read about the tool parameters in more detail in the documentation, and adjust to your data and analysis accordingly.
Examine the output
Once Prokka has finished, examine each of its output files.
GFFand GBKfiles contain all of the information about the features annotated (in different formats.)
.txtfile contains a summary of the number of features annotated.
.faafile contains the protein sequences of the genes annotated.
.ffnfile contains the nucleotide sequences of the genes annotated.
View annotated features in JBrowse
Now that we have annotated the draft genome sequence, we would like to view the sequence in the JBrowse genome viewer.
- Go to the Galaxy tool panel, and use the top search box to search for “JBrowse”.
Leave most settings as they are, except for:
Reference genome to displaychoose Use a genome from history.
Select the reference genomechoose Prokka on data XX:fna. This .fna sequence is the fasta nucleotide sequence, and will be the reference against which annotations are displayed.
Genetic Codechoose 11: The Bacterial, Archaeal and Plant Plastid Code.
Insert Track Group
Track Categorytype in gene annotations.
Insert Annotation Track
Track Typechoose GFF/GFF3/BED/GBK Features
GFF/GFF3/BED Track Dataselect Prokka on data XX:gff[Note: not wildtype.gff]
Your tool interface should look like this:
A new file will be created, called
- In place of “XX”, there will be numbers that will refer to the files that Galaxy used in your particular history.
Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.
Available Trackson the left, tick the box for Prokka on data XX:gff.
Select a contig to view, in the drop down box. You can only see one contig displayed at a time. Choose the longest contig.
Use the plus and minus buttons to zoom in and out, and the arrows to move left or right (or click and drag within the window to move left or right).
Zoom in to see the reference sequence at the top. JBrowse displays the sequence and a 6-frame amino acid translation.
Zoomed in view:
- Right click on a gene/feature annotation (the bars on the annotation track), then select
View Detailsto see more information.
- gene name
- product name
- you can download the FASTA sequence by clicking on the disk icon.
- Open the JBrowse file that you created to show the gene annotations.
- In the drop-down box, select the longest contig.
- Find an annotation for a “hypothetical protein”.
- Alternatively, in the coordinates box, enter “32,500”.
- Zoom out; see the annotation that spans this location. It is a “hypothetical protein”.
- Click on this annotation. See the nucleotide sequence: select and copy.
- Go to the NCBI page and BLAST this sequence to see what it matches.
- Try the “blastx” option, which will translate your nucleotide sequence into a protein sequence.
Enter Query Sequence, paste your sequence into the box.
Genetic Codechoose “Bacteria and Archaea”.
Database, try the “SwissProt” database. You can also re-try with other options to see how the database affects the results.
- All other options can be left as default. Click
- What does your sequence match?
- In our example, the top hit is to a carboxylase enzyme.
- In the
Show Conserved Domainswindow, we can see that this is part of a superfamily of metallo-dependent hydrolases. Click through to find out more about this superfamily.
- Genome annotation is constantly refined. We can see here that some manual investigation gave us more information about an annotation. This would need to be experimentally confirmed.
- For a detailed description of the BLAST output, see the top right corner of the page and click “Blast report description”.
See this history in Galaxy
If you want to see this Galaxy history without performing the steps above:
- Log in to Galaxy Australia: https://usegalaxy.org.au/
- Go to
Import(at the top right corner)
- The analysis should now be showing as your current history.
Here are some references covering more information about genome annotation.
Prokaryote genome annotation: Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068–9.
Eukaryote genome annotation: Yandell M, Ence D. A beginner’s guide to eukaryotic genome annotation. Nat Rev Genet. 2012 Apr 18;13(5):329–42.
Wheat genome annotation: Clavijo BJ, et al. An improved assembly and annotation of the allohexaploid wheat genome identifies complete families of agronomic genes and provides genomic evidence for chromosomal translocations. Genome Res. 2017 May;27(5):885–96.
Human genome annotation: Harrow J, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012 Sep;22(9):1760–74.
Annotation in the clinical setting: Steward CA, Parker APJ, Minassian BA, Sisodiya SM, Frankish A, Harrow J. Genome annotation for clinical genomic diagnostics: strengths and weaknesses. Genome Med. 2017 May 30;9(1):49.
Assessing your annotation: Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015 Oct 1;31(19):3210–2.
Comparative genome annotation: König S, Romoth L, Stanke M. Comparative Genome Annotation. Methods Mol Biol. 2018;1704:189–212.
To use the tutorials on this website:
- ← see the list in the left hand panel
- ↖ or, click the menu button (three horizontal bars) in the top left of the page
You can find more tutorials at the Galaxy Training Network: