RNA-Seq: Bacteria

This tutorial is about using RNA-seq data to investigate differential gene expression in bacteria, using Galaxy tools and Degust (a tool on the web).

New to Galaxy? First try the introduction and then learn some key tasks

Background

Differential Gene Expression (DGE) is the process of determining whether any genes were expressed at a different level between two conditions. For example, the conditions could be wildtype versus mutant, or two growth conditions. Usually multiple biological replicates are done for each condition - these are needed to separate variation within the condition from that between the conditions.

Input data: reads and reference

RNA-Seq reads

A typical experiment will have 2 conditions each with 3 replicates, for a total of 6 samples.

Our RNA-seq reads are from 6 samples in FASTQ format.
- We have single-end reads; so one file per sample.
- Data could also be paired-end reads, and there would be two files per sample.
These have been reduced to 1% of their original size for this tutorial.
The experiment used the bacteria E. coli grown in two conditions.
- Files labelled “LB” are the wildtype
- Files labelled “MG” have been exposed to 0.5% αMG - alpha methyglucoside (a sugar solution).

Reference genome

The reference genomes is in FASTA format and the gene annotations are in GTF format.

The FASTA file contains the DNA sequence(s) that make up the genome; e.g. the chromosome and any plasmids.
The GTF file lists the coordinates (position) of each feature. Commonly-annotated features are genes, transcripts and protein-coding sequences.

Upload files to Galaxy

Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).

Use shared data

If you are using Galaxy Australia, you can import the data from a shared data library.

In the top menu bar, go to Shared Data.

Click on Data Libraries.
Click on Galaxy Australia Training Material: RNA-Seq: Microbial RNA-Seq.
Tick the boxes next to the nine files.
Click the To History button, select As Datasets.
Name a new history and click Import.
In the top menu bar, click Analyze Data.
[Optional] Next to each file, click on the pencil icon and change (shorten) its name.
You should now have eight files in your current history ready for the analysis, plus an additional “JBrowse” file that we will use later.

Files for DGE

Or, import from the web

*Only follow this step if unable to load the data files from shared data, as described above. Click here to expand.*

* In a new browser tab, go to this webpage: * [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1311269.svg)](https://doi.org/10.5281/zenodo.1311269) * For each file, right click on file name: select "copy link address" * In Galaxy, go to Get Data and then Upload File * Click Paste/Fetch data * A box will appear: paste in link address * Click Start * Click Close * The file will now appear in the top of your history panel. * Repeat for all files in Zenodo. * Change (shorten) the file names with the pencil icon.

Convert the GTF file

In the tool panel, search for “GTF”.

Click on GTF-to-GFF converter
select the Ecoli_k12.gtf file
Click Execute

This will produce a convereted reference genome file that we need in a different format for downstream analyses.

Re-name the output file with the pencil icon, e.g. to Ecoli_k12.gff.

Align reads to reference

The RNA-Seq reads are fragmented and are not complete transcripts. To determine the transcripts from which the reads originated (and therefore, to which gene they correspond) we can map them to a reference genome.

In Galaxy:

In the Tools panel, search for Map with BWA-MEM and click on it.
Under Will you select a reference genome from your history or use a built-in index?: Use a genome from history and build index
Use the following dataset as the reference sequence: Ecoli_k12.fasta
Single or Paired-end reads: single
Select fastq dataset:
- Click on the Multiple Datasets icon in centre
- Select all 6 FASTQ files (they turn blue; use side-scroll bar to check all have been selected)
- This will map each set of reads to the reference genome

Your tool interface should look like this:

BWA-MEM tool commands

Click Execute
Click Refresh in the history pane to see if the analysis has finished.
Output: 6 bam files of reads mapped to the reference genome.
Re-name the output files:
- These are called Map with BWA-MEM on data x and data x.
- The x will refer to the numbered file that Galaxy used in the analysis.
- Click on the pencil icon next to each of these and re-name them as their sample name, e.g. LB1.bam, LB2.bam etc.

View the mapped reads

In the tool panel, search for the tool “JBrowse” and click on JBrowse Genome Browser.

Under Reference genome to display choose Use a genome from history.
Under Select the reference genome choose Ecoli_k12.fasta.
For Produce a Standalone Instance select Yes.
For Genetic Code choose 11: The Bacterial, Archaeal and Plant Plastid Code.
We will now set up several different “tracks” - these are datasets displayed underneath the reference sequence (which is displayed as nucleotides in FASTA format). We will choose to display the mapped reads from one of each of the experimental conditions, and the annotated reference genome.

Track 1 - mapped RNA-seq reads

Click Insert Track Group
For Track Cateogry name it “RNA-seq reads”
Click Insert Annotation Track
For Track Type choose BAM Pileups
For BAM Track Data select LB1.bam and MG1.bam (Your files may be named differently)
For Autogenerate SNP Track select Yes

Track 2 - annotated reference

Click Insert Track Group again
For Track Category name it “annotated reference”
Click Insert Annotation Track
For Track Type choose GFF/GFF3/BED/GBK Features
For Track Data select Ecoli_k12.gff Note - select the GFF not the GTF file
Click Execute

A new file will be created, called JBrowse on data XX and data XX - Complete.

This file may take some time to be generated. For this workshop, we downloaded a completed JBrowse file at the start, which you could look at instead (it should be the same). This is the JBrowse on data XX and data XX - Complete file; the numbers are not important.

Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.
On the left, tick boxes to display the tracks
Use the plus and minus buttons to zoom in and out.

This visualization is for us to check that the mapping of the reads worked as expected.

JBrowse screenshot

You may be able to see some places where more reads have mapped to a gene from one of the conditions. This suggests that the gene in that condition was more highly expressed.

However, this needs to be verified with statistical testing, which will be covered in the next parts of this tutorial.

Count reads per gene

We now need to count how many reads overlap with particular genes. The information about gene names is from the annotations in the GTF file.

In Galaxy:

In the Tools panel, search for count matrix and click on SAM/BAM to count matrix
- Note: Don’t select the tool called htseq-count. The SAM/BAM to count matrix also uses that tool but allows an input of multiple bam files, which is what we want.
For Gene model (GFF) file to count reads over from your current history, select the GTF file.
For Reads are stranded select Yes (box turns dark grey)
Leave the next two settings as default
For GTF feature type for counting reads… select transcript.
For bam/sam file from your history choose the 6 bam files.

Your tool interface should look like this:

HTSeq tool commands

Click Execute
Click Refresh in the history pane to see if the analysis has finished.

Output

There is one output file: bams to DGE count matrix.
Click on the file name to expand the information in the History pane.
Click on the file icon underneath to download it to your computer for use later on in this tutorial.
Click on the eye icon to see this file.

counts file

Each row is a gene (or feature) and each column is a sample, with counts against each gene.
Have a look at how the counts vary between samples, per gene. (These are quite low as we are using a cut-down data set).
We can’t just compare the counts directly; they need to be normalized before comparison, and this will be done as part of the DGE analysis in the next step.

Test for differential expression

There are various tools available to test for differential gene expression. In today’s tutorial, we will use the tool Voom (link to the paper here).

In the Tools panel, search for Differential_Count models (don’t forget the underscore) and click on it.
- This has options to use edgeR, DESeq, or Voom. Here we will use Voom.
For Select an input matrix choose the count matrix file generated in the previous step.
For Title for job outputs enter DGE using voom.
For Select columns containing treatment tick boxes for the MG samples.
For Select columns containing control tick boxes for the LB samples.
Under Run this model using edgeR choose Do not run edgeR.
Under Run the same model with DESeq2 and compare findings choose Do not run DESeq2.
Under Run the same model with Voom/limma and compare findings choose Run VOOM.

Your tool interface should look like this:

DGE using voom

Click Execute.

Output

There are two output files. We will look at the file called DEGusingvoom_topTable_VOOM.xls.

The Contig column shows the genes that had transcripts mapped (genes with low counts may be filtered out).
The adj.P.Val is the statistical significance (p value) adjusted for multiple testing. The table is sorted by this column (most significant to least significant).
The logFC is the log2 fold change, which is the change in gene expression between the treatment and control group.

toptable

We can see that the most statistically signifcant result is that the ptsG gene was expressed differently between the two conditions.

Its expression in the treatment group was lower.
The log base 2 change in expression is approximately -4, meaning the difference in expression is approximately 16x.

To check, let’s look at the ptsG gene in the original table of counts, the bams to DGE count matrix.

Click on the eye icon to view.
Search for the ptsG gene (Cmd-F to search on a Mac).
Do the read counts look different between the control samples and the treatment samples?

DGE in Degust

Degust is a tool on the web that can analyse the counts files produced in the step above, to test for differential gene expression. (Degust can also display the results from DGE analyses performed elsewhere.)

Upload counts file

Go to the Degust web page.

Click Upload your counts file.
Click on Choose File.
Select the htseq output file. tabular (that you previously downloaded to your computer from Galaxy) and click Open. This file may have a different name, for example, it may have “Galaxy” in the title.
Click Upload.

A Configuation page will appear.

For Name type DGE in E coli
For Info columns select Contig
For Min gene read count put 10.
Click Add condition
- Add a condition called “Control” and select the LB columns.
- Add a condition called “Treament” and select the MG columns.
Save changes
View - this brings up the Degust viewing window.

Overview of Degust sections

Left: Conditions: Control and Treatment.
Left: Method selection for DGE.
Top centre: Plots, with options at right.
When an expression plot is selected, a heatmap appears below.
A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).
The FDR is the False Discovery Rate, also known as the adjusted p value.

Degust overview

Analyze gene expression

Under Method, make sure that Voom/Limma is selected.
Click Apply. This runs Voom/Limma on the uploaded counts.

MDS plot

First, look at the MDS plot.

MDSplot

This is a multidimensional scaling plot which represents the variation between samples.
Ideally:
- All the LB samples would be close to each other
- All the MG samples would be close to each other
- The LB and MG groups would be far apart
The x-axis is the dimension with the highest magnitude. The control/treatment samples should be split along this axis.
Our LB samples are on the left and the MG samples are on the right, which means they are well separated on their major MDS dimension, which looks correct.

Expression - MA plot

Each dot shows the change in expression in one gene.

The average expression (over both condition and treatment samples) is represented on the x-axis.
- Plot points should be symmetrical around the x-axis.
- We can see that many genes are expressed at a low level, and some are highly expressed.
The fold change is represented on the y axis.
- If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
- At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.

Click on the dot to see the gene name.

MAplot

Expression - Volcano plot

Another way to view expression levels is with the volcano plot.

As with the MA plot, each dot is a gene.
This time, the logFC axis is horizontal (in the MA plot, it was vertical).
The vertical axis is a measure of statistical significance (-log10 FDR).
If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
This is a quick way to visualize those genes with large changes in expression (the left and right sides of the graph).

volcano plot

Expression - Parallel Coordinates and heatmap

Each line shows the change in expression in one gene, between control and treatment.

Go to Options at the right.
- For FDR cut-off set at 0.001.
- This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
Look at the Parallel Coordinates plot. There are two axes:
- Left: Control: Gene expression in the control samples. All values are set at zero.
- Right: Treatment Gene expression in the treatment samples, relative to expression in the control.
The blocks of blue and red underneath the plot are called a heatmap.
- Each block is a gene. Click on a block to see its line in the plot above.
- Look at the row for the Treatment. Relative to the control, genes expressed more are red; genes expressed less are blue.

Parallel Coordinates plot

Note:

for an experiment with multiple treatments, the various treatment axes can be dragged to rearrange. There is no natural order (such as a time series).

Table of genes

Contig: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
Control and Treatment: log2(Fold Change) of gene expression. The default display is of fold change in the treatment relative to the control. Therefore, values in the “Control” column are zero. This can be changed in the Options panel at the top right.
In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.

Table of genes and expression:

gene table

Investigate differentially-expressed genes

To learn more about the differentially-expressed genes:

Go to the NCBI website.
Under All Databases, click on Gene
Enter the gene name in the search bar; e.g. ptsG
Click on the first result that matches the species (e.g. in this case, E. coli).
- This provides information about the gene, and may also show further references (e.g. in this case, a link to the EcoGene resource).

Some of the most (statistically) significant differentially-expressed genes in this experiment are:

ptsG: a glucose-specific transporter.
setA: a sugar efflux transporter; is induced by glucose-phosphate stress.
sucD: the alpha subunit of the the gene for succinyl-CoA synthetase; involved in ATP production.
sucB: a component of the 2-oxoglutarate dehydrogenase complex; catalyzes a step in the Krebs cycle.
deoC: 2-deoxyribose-5-phosphate aldolase; binds selenium; may be involved in selenium transport.

See this history in Galaxy

If you want to see this Galaxy history without performing the steps above:

Log in to Galaxy Australia: https://usegalaxy.org.au/
Go to Shared Data
Click Histories
Click Published-RNA-seq-bacteria
Click Import (at the top right corner)
The analysis should now be showing as your current history.

More information

Here are some references covering more information about RNA-seq.

** A clear, up-to-date introduction to RNA-seq and testing for differential expression (pre-print): ** Berge, K Hembach, K Soneson, C Tiberi, S Clement, L Love, M Patro, R Robinson, M. RNA sequencing data: hitchhiker’s guide to expression analysis. PeerJPreprints. Available from: http://dx.doi.org/10.7287/peerj.preprints.27283

** Transcriptome assembly: ** Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671–82.

** SuperTranscripts: ** Davidson NM, Hawkins ADK, Oshlack A. SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes. Genome Biol. 2017 Aug 4;18(1):148.

** Splicing: ** Alamancos GP, Agirre E, Eyras E. Methods to study splicing from high-throughput RNA sequencing data. Methods Mol Biol. 2014;1126:357–97.

** Single-cell RNA-seq: ** McCarthy DJ, Campbell KR, Lun ATL, Wills QF. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017 Apr 15;33(8):1179–86.

** Pathway analysis: ** Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012 Feb 23;8(2):e1002375.

What’s next?

To use the tutorials on this website:

← see the list in the left hand panel
↖ or, click the menu button (three horizontal bars) in the top left of the page

You can find more tutorials at the Galaxy Training Network:

http://galaxyproject.github.io/training-material/