RNA-Seq: Bacteria
This tutorial is about using RNA-seq data to investigate differential gene expression in bacteria, using Galaxy tools and Degust (a tool on the web).
Background
Differential Gene Expression (DGE) is the process of determining whether any genes were expressed at a different level between two conditions. For example, the conditions could be wildtype versus mutant, or two growth conditions. Usually multiple biological replicates are done for each condition - these are needed to separate variation within the condition from that between the conditions.
Input data: reads and reference
RNA-Seq reads
A typical experiment will have 2 conditions each with 3 replicates, for a total of 6 samples.
- Our RNA-seq reads are from 6 samples in
FASTQ format.- We have single-end reads; so one file per sample.
- Data could also be paired-end reads, and there would be two files per sample.
- These have been reduced to 1% of their original size for this tutorial.
- The experiment used the bacteria E. coli grown in two conditions.
- Files labelled “LB” are the wildtype
- Files labelled “MG” have been exposed to 0.5% αMG - alpha methyglucoside (a sugar solution).
Reference genome
The reference genomes is in
- The
FASTA file contains the DNA sequence(s) that make up the genome; e.g. the chromosome and any plasmids. - The
GTF file lists the coordinates (position) of each feature. Commonly-annotated features are genes, transcripts and protein-coding sequences.
Upload files to Galaxy
- Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).
Use shared data
If you are using Galaxy Australia, you can import the data from a shared data library.
In the top menu bar, go to
- Click on
Data Libraries . - Click on
Galaxy Australia Training Material: RNA-Seq: Microbial RNA-Seq . - Tick the boxes next to the nine files.
- Click the
To History button, select As Datasets. - Name a new history and click
Import . - In the top menu bar, click
Analyze Data . - [Optional] Next to each file, click on the pencil icon and change (shorten) its name.
- You should now have eight files in your current history ready for the analysis, plus an additional “JBrowse” file that we will use later.
Or, import from the web
*Only follow this step if unable to load the data files from shared data, as described above. Click here to expand.*
* In a new browser tab, go to this webpage: * [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1311269.svg)](https://doi.org/10.5281/zenodo.1311269) * For each file, right click on file name: select "copy link address" * In Galaxy, go toConvert the GTF file
In the tool panel, search for “GTF”.
- Click on
GTF-to-GFF converter - select the
Ecoli_k12.gtf file - Click
Execute
This will produce a convereted reference genome file that we need in a different format for downstream analyses.
Re-name the output file with the pencil icon, e.g. to
Align reads to reference
The RNA-Seq reads are fragmented and are not complete transcripts. To determine the transcripts from which the reads originated (and therefore, to which gene they correspond) we can map them to a reference genome.
In Galaxy:
- In the
Tools panel, search forMap with BWA-MEM and click on it. - Under
Will you select a reference genome from your history or use a built-in index? : Use a genome from history and build index Use the following dataset as the reference sequence :Ecoli_k12.fasta Single or Paired-end reads : singleSelect fastq dataset :- Click on the
Multiple Datasets icon in centre - Select all 6
FASTQ files (they turn blue; use side-scroll bar to check all have been selected) - This will map each set of reads to the reference genome
- Click on the
Your tool interface should look like this:
- Click
Execute - Click
Refresh in the history pane to see if the analysis has finished. -
Output: 6
bam files of reads mapped to the reference genome. -
Re-name the output files:
- These are called
Map with BWA-MEM on data x and data x . - The x will refer to the numbered file that Galaxy used in the analysis.
- Click on the pencil icon next to each of these and re-name them as their sample name, e.g.
LB1.bam ,LB2.bam etc.
- These are called
View the mapped reads
In the tool panel, search for the tool “JBrowse” and click on
-
Under
Reference genome to display choose Use a genome from history. -
Under
Select the reference genome chooseEcoli_k12.fasta . -
For
Produce a Standalone Instance select Yes. -
For
Genetic Code choose 11: The Bacterial, Archaeal and Plant Plastid Code. -
We will now set up several different “tracks” - these are datasets displayed underneath the reference sequence (which is displayed as nucleotides in FASTA format). We will choose to display the mapped reads from one of each of the experimental conditions, and the annotated reference genome.
Track 1 - mapped RNA-seq reads
- Click
Insert Track Group - For
Track Cateogry name it “RNA-seq reads” - Click
Insert Annotation Track - For
Track Type choose BAM Pileups - For
BAM Track Data selectLB1.bam andMG1.bam (Your files may be named differently) - For
Autogenerate SNP Track select Yes
Track 2 - annotated reference
- Click
Insert Track Group again - For
Track Category name it “annotated reference” - Click
Insert Annotation Track - For
Track Type choose GFF/GFF3/BED/GBK Features - For
Track Data selectEcoli_k12.gff Note - select the GFF not the GTF file
- Click
Execute
A new file will be created, called
This file may take some time to be generated. For this workshop, we downloaded a completed JBrowse file at the start, which you could look at instead (it should be the same). This is the
-
Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.
-
On the left, tick boxes to display the tracks
-
Use the plus and minus buttons to zoom in and out.
This visualization is for us to check that the mapping of the reads worked as expected.
You may be able to see some places where more reads have mapped to a gene from one of the conditions. This suggests that the gene in that condition was more highly expressed.
However, this needs to be verified with statistical testing, which will be covered in the next parts of this tutorial.
Count reads per gene
We now need to count how many reads overlap with particular genes. The information about gene names is from the annotations in the GTF file.
In Galaxy:
- In the
Tools panel, search forcount matrix and click onSAM/BAM to count matrix - Note: Don’t select the tool called htseq-count. The SAM/BAM to count matrix also uses that tool but allows an input of multiple bam files, which is what we want.
- For
Gene model (GFF) file to count reads over from your current history , select theGTF file. - For
Reads are stranded select Yes (box turns dark grey) - Leave the next two settings as default
- For
GTF feature type for counting reads… select transcript. - For
bam/sam file from your history choose the 6bam files.
Your tool interface should look like this:
- Click
Execute - Click
Refresh in the history pane to see if the analysis has finished.
Output
- There is one output file:
bams to DGE count matrix . - Click on the file name to expand the information in the History pane.
- Click on the file icon underneath to download it to your computer for use later on in this tutorial.
- Click on the eye icon to see this file.
- Each row is a gene (or feature) and each column is a sample, with counts against each gene.
- Have a look at how the counts vary between samples, per gene. (These are quite low as we are using a cut-down data set).
- We can’t just compare the counts directly; they need to be normalized before comparison, and this will be done as part of the DGE analysis in the next step.
Test for differential expression
There are various tools available to test for differential gene expression. In today’s tutorial, we will use the tool Voom (link to the paper here).
- In the
Tools panel, search forDifferential_Count models (don’t forget the underscore) and click on it.- This has options to use edgeR, DESeq, or Voom. Here we will use Voom.
- For
Select an input matrix choose thecount matrix file generated in the previous step. - For
Title for job outputs enter DGE using voom. - For
Select columns containing treatment tick boxes for the MG samples. - For
Select columns containing control tick boxes for the LB samples. - Under
Run this model using edgeR choose Do not run edgeR. - Under
Run the same model with DESeq2 and compare findings choose Do not run DESeq2. - Under
Run the same model with Voom/limma and compare findings choose Run VOOM.
Your tool interface should look like this:
- Click
Execute .
Output
There are two output files. We will look at the file called
- The
Contig column shows the genes that had transcripts mapped (genes with low counts may be filtered out). - The
adj.P.Val is the statistical significance (p value) adjusted for multiple testing. The table is sorted by this column (most significant to least significant). - The
logFC is the log2 fold change, which is the change in gene expression between the treatment and control group.
We can see that the most statistically signifcant result is that the ptsG gene was expressed differently between the two conditions.
- Its expression in the treatment group was lower.
- The log base 2 change in expression is approximately -4, meaning the difference in expression is approximately 16x.
To check, let’s look at the ptsG gene in the original table of counts, the
- Click on the eye icon to view.
- Search for the ptsG gene (Cmd-F to search on a Mac).
- Do the read counts look different between the control samples and the treatment samples?
DGE in Degust
Degust is a tool on the web that can analyse the counts files produced in the step above, to test for differential gene expression. (Degust can also display the results from DGE analyses performed elsewhere.)
Upload counts file
Go to the Degust web page.
- Click
Upload your counts file . - Click on
Choose File . - Select the
htseq output file. tabular (that you previously downloaded to your computer from Galaxy) and clickOpen . This file may have a different name, for example, it may have “Galaxy” in the title. - Click
Upload .
A Configuation page will appear.
- For
Name type DGE in E coli - For
Info columns select Contig - For
Min gene read count put 10. - Click
Add condition - Add a condition called “Control” and select the LB columns.
- Add a condition called “Treament” and select the MG columns.
Save changes View - this brings up the Degust viewing window.
Overview of Degust sections
- Left: Conditions: Control and Treatment.
- Left: Method selection for DGE.
- Top centre: Plots, with options at right.
- When an expression plot is selected, a heatmap appears below.
- A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).
- The FDR is the False Discovery Rate, also known as the adjusted p value.
Analyze gene expression
- Under
Method , make sure thatVoom/Limma is selected. - Click
Apply . This runs Voom/Limma on the uploaded counts.
MDS plot
First, look at the MDS plot.
- This is a multidimensional scaling plot which represents the variation between samples.
- Ideally:
- All the LB samples would be close to each other
- All the MG samples would be close to each other
- The LB and MG groups would be far apart
- The x-axis is the dimension with the highest magnitude. The control/treatment samples should be split along this axis.
- Our LB samples are on the left and the MG samples are on the right, which means they are well separated on their major MDS dimension, which looks correct.
Expression - MA plot
Each dot shows the change in expression in one gene.
- The average expression (over both condition and treatment samples) is represented on the x-axis.
- Plot points should be symmetrical around the x-axis.
- We can see that many genes are expressed at a low level, and some are highly expressed.
- The fold change is represented on the y axis.
- If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
- At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.
Click on the dot to see the gene name.
Expression - Volcano plot
Another way to view expression levels is with the volcano plot.
- As with the MA plot, each dot is a gene.
- This time, the logFC axis is horizontal (in the MA plot, it was vertical).
- The vertical axis is a measure of statistical significance (-log10 FDR).
- If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
- This is a quick way to visualize those genes with large changes in expression (the left and right sides of the graph).
Expression - Parallel Coordinates and heatmap
Each line shows the change in expression in one gene, between control and treatment.
- Go to
Options at the right.- For
FDR cut-off set at 0.001. - This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
- For
-
Look at the Parallel Coordinates plot. There are two axes:
- Left: Control: Gene expression in the control samples. All values are set at zero.
- Right: Treatment Gene expression in the treatment samples, relative to expression in the control.
-
The blocks of blue and red underneath the plot are called a heatmap.
- Each block is a gene. Click on a block to see its line in the plot above.
- Look at the row for the Treatment. Relative to the control, genes expressed more are red; genes expressed less are blue.
Note:
- for an experiment with multiple treatments, the various treatment axes can be dragged to rearrange. There is no natural order (such as a time series).
Table of genes
- Contig: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
- FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
- Control and Treatment: log2(Fold Change) of gene expression. The default display is of fold change in the treatment relative to the control. Therefore, values in the “Control” column are zero. This can be changed in the
Options panel at the top right. - In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.
Table of genes and expression:
Investigate differentially-expressed genes
To learn more about the differentially-expressed genes:
- Go to the NCBI website.
- Under
All Databases , click on Gene - Enter the gene name in the search bar; e.g. ptsG
- Click on the first result that matches the species (e.g. in this case, E. coli).
- This provides information about the gene, and may also show further references (e.g. in this case, a link to the EcoGene resource).
Some of the most (statistically) significant differentially-expressed genes in this experiment are:
- ptsG: a glucose-specific transporter.
- setA: a sugar efflux transporter; is induced by glucose-phosphate stress.
- sucD: the alpha subunit of the the gene for succinyl-CoA synthetase; involved in ATP production.
- sucB: a component of the 2-oxoglutarate dehydrogenase complex; catalyzes a step in the Krebs cycle.
- deoC: 2-deoxyribose-5-phosphate aldolase; binds selenium; may be involved in selenium transport.
See this history in Galaxy
If you want to see this Galaxy history without performing the steps above:
- Log in to Galaxy Australia: https://usegalaxy.org.au/
- Go to
Shared Data - Click
Histories - Click
Published-RNA-seq-bacteria - Click
Import (at the top right corner) - The analysis should now be showing as your current history.
More information
Here are some references covering more information about RNA-seq.
** A clear, up-to-date introduction to RNA-seq and testing for differential expression (pre-print): ** Berge, K Hembach, K Soneson, C Tiberi, S Clement, L Love, M Patro, R Robinson, M. RNA sequencing data: hitchhiker’s guide to expression analysis. PeerJPreprints. Available from: http://dx.doi.org/10.7287/peerj.preprints.27283
** Transcriptome assembly: ** Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671–82.
** SuperTranscripts: ** Davidson NM, Hawkins ADK, Oshlack A. SuperTranscripts: a data driven reference for analysis and visualisation of transcriptomes. Genome Biol. 2017 Aug 4;18(1):148.
** Splicing: ** Alamancos GP, Agirre E, Eyras E. Methods to study splicing from high-throughput RNA sequencing data. Methods Mol Biol. 2014;1126:357–97.
** Single-cell RNA-seq: ** McCarthy DJ, Campbell KR, Lun ATL, Wills QF. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics. 2017 Apr 15;33(8):1179–86.
** Pathway analysis: ** Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012 Feb 23;8(2):e1002375.
What’s next?
To use the tutorials on this website:
- ← see the list in the left hand panel
- ↖ or, click the menu button (three horizontal bars) in the top left of the page
You can find more tutorials at the Galaxy Training Network: