Metagenomics
Based on material by QFAB http.qfab.org
Tutorial Overview
In this tutorial we cover the concepts of metagenomics analysis in 16S rRNA.
What’s not covered:
This introductory tutorial covers the steps required for the first part of a 16S rRNA metagenomics study applied to samples from a single site. As such we only show analyses that can be applied to a single environment, that is, identifying what organisms are in your sample (diversity), and how many organisms there are in the sample (richness). Similar same steps would be applied if you had samples from multiple sites, with additional analyses that can compare the diversity and richness between samples. The latter part is not covered by this tutorial.
Background
16S rRNA sequencing
The application of culture-independent rRNA-based phylogenetics to the study of bacterial diversity was first explored by researchers in the 1980s e.g. [2] and was used by Carl Woese to delineate the main branches of life [1]. These rRNA based analyses remain central methods in microbiology both for the exploration of species diversity but also or the application in bacterial identification.
The development of pyrosequencing technologies and the emergence of next generation DNA sequencing have changed the scales at which scientists can systematically survey the 16S rRNA content of biological samples and has been a key driver behind initiatives such as the human microbiome project [5].
In the broadest sense, 16S rRNA sequencing should not be confused with metagenomics which has been defined as “the application of shotgun sequencing to DNA obtained directly from an environmental sample or series of related samples, producing at least 50Mbp of randomly sampled sequence data” [3] This is also in contrast to functional metagenomics whereby environmental DNA is first cloned and subsequently screened for specific functional activities e.g. [4]. Kunin et al elegantly argue that 16S rRNA profiling should in fact exist as a mandatory “Premetagenome community composition profiling” step in more comprehensive surveys [3]. There is discussion and debate as to how deep this sequencing could be performed but at JGI at least, a single 384 well plate is routinely surveyed.
Irrespective of the nuances of terminology, metagenomics is a baseline technology for understanding the ecology and evolution of microbial ecosystems upon which hypotheses and experimental strategies are built.
Experimental design is key
Most 16S strategies are biased towards the measurement of bacterial and archaeal constituents of the environment and eukaryotes (e.g. fungi and protists) are excluded. This compromises the ability to measure the whole ecological spectrum of a microbial community; the larger and non-coding and thus uninterpretable parts of the eukaryotic genome are problematic. This is largely driving strategies that encompass additional metatranscriptomics surveys. The studies are thus oriented towards the sequence-tractable bacterial, archaeal and viral components of the research community.
Limitations of the 16S rRNA profiling strategy
The 16S rRNA gene copy number of variable between bacterial species and this is further confounded by PCR amplification biases which skew estimates of the community composition.
Goals
The goal of the 16S study is to understand the complexity of the community being sampled. This is assessed as a function of the number of species in the community (richness) and their relative abundances (evenness).
The workflow in Galaxy
The process presented in this workshop is aimed to be an introduction for molecular biologists to 16S rRNA metagenomics analysis. The workflow is also implemented such that it can be run using a standard desktop or laptop environment without the requirements of a large cluster or cloud infrastructure.
The steps are:
De-replicate reads Cluster reads into Operational Taxonomic Units (OTU) Chimera detection and removal OTU Table Visualize taxa in a pie chart Phylogenetic Tree Rarefaction analysis Collector Curve analysis
Where is the data in this tutorial from?
We use the same data from the Mothur Marine community analysis, that is the study from the Global Ocean Sampling (GOS) Expedition. However we only concentrate on one site as opposed to the 14 sites they sampled.
Preparation
- Log in to your Galaxy instance (for example, Galaxy Australia, usegalaxy.org.au).
Use shared data
If you are using Galaxy Australia, you can import the data from a shared data library.
In the top menu bar, go to
- Click on
Data Libraries . - Click on
Galaxy Australia Training Material: Metagenomics: 16S Metagenomics . - Tick the boxes next to the three files.
- Click the
To History button, select As Datasets. - Name a new history and click
Import . - In the top menu bar, click
Analyze Data . - You should now have three files in your current history.
Or, import from the web
Only follow this step if unable to load the data files from shared data, as described above.
- In a new browser tab, go to this webpage:
- Right click on file name: select “copy link address”
- In Galaxy, go to
Get Data and then Upload File - Click
Paste/Fetch data - A box will appear: paste in link address
- Click
Start - Click
Close - The file will now appear in the top of your history panel.
- Repeat for the other two files.
Shorten file names
- Click on the pencil icon next to the file name
- In the centre Galaxy panel, click in the box under
Name - Shorten the file name
- Then click
Save
Keep only the unique reads
The aim here is to reduce the number of reads in the input file into clusters of identical sequences.
De-replicate
In the tool panel (left hand side) search for the
- For
Input sequence file select theseqs.fasta dataset - For
Criteria used for duplicate detection select Full length
There are two possible modes in the dereplication tool: (1) full-length where the full length of the two sequences must be identical before being grouped together or (2) prefix where the first section of the two sequences are identical
- Click on
Execute
Output
Click on the name of the job
Examine the output by clicking on the eye icon in the top right hand corner of this step.
-
NOTE: The size=XXXX annotation in the id line of the fasta file indicates the number of reads that were exactly the same as this representative sequence.
-
(OPTIONAL) You can rename the output by clicking on the pencil icon on the top right hand corner of the output. Click on the “Attributes tab” and then type in a new name in the “Name” field and click on Save.
Cluster sequences into groups
The aim here is to cluster the output from Step 1 into groups of sequences that are 97% similar to each other.
- Similar sequences are assumed to belong to the same taxonomic groups. * These groups are known as Operational Taxonomic Units (OTUs) and the number of clusters that result are indicative of the number of species that are present in this sample.
Cluter OTUs
In the tool panel on the left hand side, search for the
- For
Input sequence file , select the output from Step 1:Dereplicate on data XX - For
Minimum cluster radius , enter 0.97. This value corresponds to 97% sequence identity. - Click on
Execute
Output
Click on the name of the job
- Examine the output by clicking on the eye icon in the top right hand corner of this step.
- The output file is in Fasta format where each sequence is the representative of an OTU.
- Figure 2 below, provides a summary of the clustering approach. Reads are represented by solid dots. Those that are 97% similar to each other are grouped together into an OTU as represented by the blue dashed circles.
- In this example, there are 3 OTUs. The reads that represents the centroid of each cluster are shown as red dots and these red dots form the representative of the corresponding OTU. These are the sequences that are reported in the output file. In this tutorial, the output file has 559 sequences, meaning that 559 OTUs were identified.
- NOTE: The number of sequences that you have may not be exactly the same as mentioned here. This is so as the clustering process starts by randomly selecting a read as the center of a cluster (see figure below). Since this is a random selection, each time you perform a clustering process, a different read can be selected and thus affect the end result.
For more detailed description of the OTU clustering algorithm, click here.
Figure 2.
Please Note: The “size=XX” annotation in the identifier line is NOT the number of sequences that is in the corresponding cluster (OTU). This value is the same as the input file.
Remove chimeras
The aim here is to: identify and filter chimeric sequences utilizing a reference dataset containing nucleotide sequences believed to be free of chimeras.
Uchime
In the tool panel on the left hand side, search for the
- For
Mode to detect chimeras , select “ref” (reference mode).
There are two possible modes in the Uchime tool for chimera detection: (1) ref where a reference dataset of parent sequences believed to be chimera-free is provided by the user or (2) de-novo where a database is constructed from the query sequences by using the abundance data.
- For the second field
Input reference file select theCluster OTU on data XX - For the third field
Reference Database , select thegold.fasta . This file (downloaded from http://drive5.com/uchime/gold.fa) contains nucleotide sequences believed to be free of chimeras. - Click on
Execute
Output
-
Click on
Uchime on data XX and data XX:chimeras - These sequences have been classified to be chimeras
-
Click on
Uchime on data XX and data XX:non chimeras - These sequences have been classified to be non-chimeras
- Only the “non-chimeras” output be used for the subsequent steps.
- All other sequences, chimeras and nonclassified will be disregarded.
There are 2 hidden log files also generated from this tool, they contain detailed job information about the classification into non-chimeras and chimeras.
(OPTIONAL) To access the hidden outputs, click on the cog wheel in the upper right corner of the History panel and select “Include Hidden Datasets”.
-
Click on
Uchime on data XX and data XX:Human-readable output - The chimeric alignments (with respect to the putative parents) are listed following the human-readable format of the USEARCH-Tool-Suite.
-
Click on
Uchime on data XX and data XX:Tabbed output - For each sequence there are 18 values logged in a tabbed format, as following:
NOTE: The number of sequences that you have may not be exactly the same due to the results of the OTU Clustering step in Section 2.
How many reads per OTU cluster?
The aim here is to generate a full OTU Table which consists of the number of reads found for each group, the representative sequence and the predicted taxonomy.
- This is accomplished by running two tools and merging the resulting outputs together.
-
The first tool “Map Reads to OTU”, as the name indicates will map read sequences to the OTU cluster representatives from Section 2.
- In section 2, we clustered the reads into clusters that are 97% similar to all other members of that cluster. Then for each cluster, a representative sequence was selected. However, we do not know the number of reads that are in each cluster. Furthermore, in Section 3 we removed sequences that were potential chimeric sequences; this alters the number of remaining clusters.
- Therefore, in this step we first find out which of the remaining clusters each read belongs with 97% similarity and this will give us a count of the number of reads per cluster.
-
The second tool “RDP Multi Classifier” will assign each the OTU representative sequence to a taxonomy level.
-
The last step then merges the information to generate the OTU Table
Map reads to OTU - how many reads per OTU?
In the tool panel on the left hand side, search for the
- For
OTU fasta file select datasetUchime on data XX and data XX: non_chimeras - For
Input reads file select datasetDereplicate on data XX - For
Minimum identity enter 0.97. Like before, this value corresponds to 97% sequence similarity. - Click on
Execute
Output
-
Click on
Map Reads to OTU on data XX and data XX:rabund - The output is tab delimited and follows the ‘rabund’ format from mothur.
- Note that this is just a one-line file which wraps around in the Galaxy display window to look like a multi-line file.
- In other words, this is a matrix of one row only, and many columns.
- The first column is a label (this is usually the identity)
- The second column is the number of OTUs found
- Each subsequent column is the number of reads found for the corresponding OTUs.
-
Click on
Map Reads to OTU on data XX and data XX:Pre-OTU Table - This is the pre OTU table with the following columns: OTU label, Count, Sequence
-
Click on
Map Reads to OTU on data XX and data XX:relabelled OTU - This is a FASTA file.
- The input reads from
Uchime on data XX and data XX: non_chimeras have been re-labelled with convenient labels for parsing, e.g. OTU_1, OTU_2,…OTU_N, where N is the number of OTUs.
There is one hidden log file also generated from this tool, which reports whether a query sequence matches or not matches the database.
(OPTIONAL) To access the hidden output, click on the cog wheel in the upper right corner of the History panel and select “Include Hidden Datasets”.
-
Click on
Map Reads to OTU on data XX and data XX:hit list - This file reports query sequences that match or did not match the database
Cluster OTU on data XX following the UC-format of the USEARCH-Tool-Suite. - Note that a given read may match two or more OTUs given the identity threshold. In such cases the tool will tend to assign the read to the OTU with highest identity and will break ties arbitrarily.
- Some reads may not match any OTU for these reasons: (1) the read is chimeric, (2) the read has more than 3% errors, (3) the read has a singleton sequence so was discarded.
- This file reports query sequences that match or did not match the database
RDP Classifier - what taxa match the OTUs?
A complete OTU table contains the taxonomy assignment of each sequence. To complete the Pre-OTU Table, we will use the
In the tool panel on the left hand side, search for the
- For
Select Gene Trainings Model select “16S rRNA” from the dropdown list. - For
Select to generate an OTU Table click Yes. - For
Relabelled OTU input reads of the ‘Map Reads to OTU’ tool in FASTA format chooseMap Reads to OTU on data XX and data XX:relabelled OTU - For
PRE OTU table of the ‘Map Reads to OTU’ tool selectMap Reads to OTU on data XX and data XX:Pre-OTU Table -
For
Assignment confidence cutoff , enter 0.8.- This value is used to specify the assignment confidence cutoff used to determine the assignment count in the hierarchical format. The range is between 0 and 1 with 0.8 being the default value. For sequences that are shorter than 250 base pairs, the tool developers recommend using 0.5 as the confidence threshold. (http://rdp.cme.msu.edu/tutorials/classifier/RDPtutorial_MULTICLASSIFIER.html)
-
For
Tab delimited output format , select “fixrank“ from the drop-down list - Click on
Execute
Output
-
Click on
RDP MultiClassifier on data XX and data XX:classification_assignment_hierarchical.tab - Sequence count for each taxon in the hierarchy
-
Click on
RDP MultiClassifier on data XX and data XX:classification_assignment_details.tab - Sequence-by-sequence classification results including confidence scores at each level of the hierarchy
-
Click on
RDP MultiClassifier on data XX and data XX:OTU_Table.tab - The OTU table is composed of the following columns: OTU label, Count, Sequence, Phylogenetic lineage.
View taxa with Krona
Krona is a tool that shows you a pie chart of the taxa found in your sample.
Tidy the output
Cut some columns
Search for
Cut columns : c2,c4,c7,c10,c13,c16,c19Delimited by : TabFrom : theRDP MultiClassifier OTU_Table.tab Execute
Take out the quotation marks
Search for
- For
Select lines from : the previous file,Cut on data XX - Click
Insert Check Find Regex : “Replacement : delete any text in here and leave emptyExecute
Krona
Search for
- For
Select input file : the previous file,Regex Find And Replace on data XX Is this output from MOTHUR : NoExecute
Click on the eye icon next to the output file.
- This is a pie chart showing the abundance of taxa. The higher taxonomic levels are in the centre, and the lower taxonomic levels are in the outer circle.
- Click on a taxon to display the information in more detail.
- Most of our sample is classified as Bacteria.
- A particularly abundant genus is Martelelella.
Phylogeny
How are our taxa related phylogenetically? We will run two tools to investigate.
PyNAST
The aim here is to build and visualise the phylogenetic tree.
- First, we align the sequences to a template sequence using PyNAST.
- PyNast is a reimplementation of the NAST algorithm to align each provided sequence (the “candidate” sequence) to the best-matching sequence in a pre-aligned database of sequences (the “template” sequence).
In the tool panel on the left hand side, search for the
- For
Candidate file (Fasta format) selectUchime on data XX and data XX:non_chimeras - For
Template alignment file (Fasta format) , selectcore_set_aligned.fasta -
For
Minimum sequence length , enter 1000.- The minimum sequence length is the length to include in the NAST alignment and it has to be set according to your set of your sequences.
-
Click
Execute
Outputs:
-
Click on
PyNAST on data XX and data XX: pynast aligned - This file contains the alignment. You will find that the number of sequences might have decreased. This means that the alignment for some sequences failed. More information is given in the log file and the failure file, which are hidden outputs.
(OPTIONAL) To access the hidden outputs, click on the cog wheel in the upper right corner of the History panel and select “Include Hidden Datasets”.
- Click on
PyNAST on data XX and data XX: pynast log file
This file is composed of the following columns:
- candidate sequence ID
- candidate nucleotide count
- errors
- template ID
- BLAST percent identify to template
- candidate nucleotide count post-NAST
You will observe that some sequences have an entry in the error column: “No search results”. These are the sequences that failed to align.
-
Click on
PyNAST on data XX and data XX: pynast failure - This file contains the sequences that failed to align. We already found out through the log file which sequences failed to align; here those sequences are given in FASTA format.
FastTree
Next we use “Fast Tree” to generate the phylogenetic tree using the aligned sequenced file.
In the tool panel on the left hand side, search for the
- For
Aligned sequences file (FASTA format) , selectPyNAST on data XX and data XX:pynast aligned - For
Protein or nucleotide alignment select “Nucleotide” - Click
Execute
Output
-
Click on
FastTree on data 18:tree.nhx - To visualize the phylogenetic tree, click on the “Visualize” icon
under the file name.
- Click on
Phyloviz - Phyloviz is the interactive Phylogenetic Tree Visualizer included from Galaxy. Currently Phyloviz has only one entry point, which is via the “Visualize - View in Phyloviz” icon that will appear for all supported phylogenetic data sets.
- Phyloviz will lay out the tree in a linear format. All phylogenetic distances are normalized to a value between 0 to 1.0 inclusive, and a default value of 250px would represent 1.0 units from a parent node to child node. The laying out of phyloviz is “in-place” meaning that the tree is presented in the same order as it is represented in the data and that no nodes are shuffled.
- Phyloviz supports a number of user interaction features to help you present, analyze and share the visualization. Please see the Galaxy wiki for more information on Phyloviz.
- To visualize the phylogenetic tree, click on the “Visualize” icon
(OPTIONAL) To access the hidden log file, click on the cog wheel in the upper right corner of the History panel and select “Include Hidden Datasets”.
-
Click on
FastTree on data XX:log_FastTree_run.txt - This is the log file from Fast tree and shows job related information. It is not required for the analysis workflow.
Was sampling sufficient? Rarefaction analysis
The aim here is to generate an intra-sample rarefaction curve to determine if the sampling effort was sufficient.
- Rarefaction curves provide a way of comparing the richness observed in different samples although it is a better measure of sample diversity.
- Briefly, it plots the average number of OTUs that you would expect to observe against the number of sequences.
- If the curve plateaus off, then it is a good indication that the sample is saturated and you have captured the majority (if not all) of species/strains present in the sample.
- However, if the curve does not plateau off, this suggests there may still be more species/strains that have not been captured in your data and you need increase the size of your sample.
Rarefaction tool
In the tool panel on the left hand side, search for the
- For
Input file format select rabund from the dropdown list - For
OTU list (rabund or sabund format) , selectMap Reads to OTU on data XX and data XX:rabund -
For
Minimum identity used in OTU clustering , enter the minimum identity used when you generated the OTU Table. In this tutorial we used 0.97.- In case you can’t remember the minimum identity used, go back to the output file
Cluster OTU on data XX in your History. Click on the file name to extend the dataset information. Click on the “View details” icon. In the Input Parameter section you will find the “Minimum cluster radius” you used in the OTU Clustering step.
- In case you can’t remember the minimum identity used, go back to the output file
-
Click
Execute
Output
-
Click on
Rarefaction on data XX .- The output is a tabbed delimited file with the following columns:
- The first column (numsampled): the number sampled and indicates the level of sampling intensity (by default this information is provided every 100 individuals)
- The second column (0.03) is the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1000 by default. This would represents the data for the y-axis for the corresponding minimum identity (1-0.97=0.03).
- The third column (lci) and fourth column (hci) represent the confidence intervals.
To plot the rarefaction curve:
- Click on
Rarefaction on data XX to extend the dataset information. - Click on the “Visualize” icon
- Select “Scatterplot”
- Under Column 1, highlight “X Column” (the number sampled)
- Under Column 2, highlight “Y Column” (the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1,000 by default).
- Click Draw
Estimate diversity
The aim here is to estimate the diversity and richness of the sample.
- Collector’s curves describe how the richness or diversity change as you add more samples into the dataset.
- If a collector’s curve plateaus, it is a good indication that enough samples were collected and you can have confidence in the last value of the curve.
Collector’s curve
In the tool panel on the left hand side, search for the
- For
OTU list (rabund, sabund, list or shared format) , selectMap Reads to OTU on data XX and data XX:rabund -
For
Labels - OTU labels , ClickSelect and make sure that0.03 appears in the box.- This correspondd to 97%. Note that this value is the inverse of other similarity thresholds we used thus far (1-0.97=0.03) and is the default value.
-
For
Calculators , there is no need to change anything from the default calculators:- chao - Community richness - the Chao estimator
- invsimpson - Community diversity - the Simpson index
- npshannon - Community diversity - the non-parametric Shannon index.
-
Click
Execute
Output
-
Click on
Collector Curve on data XX:tab - This file provides a summary containing the following fields: number of sequences, the sample coverage, the number of observed OTUs, and then a summary of each of the selected calculators that were applied to the sample.
-
Click on
Collector Curve on data XX:tab (invsimpson) - The Simpson’s Inverse Index also referred to as Simpson’s Reciprocal Index is a measure of diversity.
- Species diversity is a way to compare samples that incorporates both a measure of evenness (representation by each species) and richness (number of each species in the sample).
- This index starts with the value of 1 as the lowest possible figure. This figure would represent a community containing only one species. The higher the value, the greater the diversity. The maximum value is the number of species in the sample.
-
Click on
Collector Curve on data XX:tab (npshannon) - The non-parametric Shannon Index, like the Simpson’s Inverse Index, is a diversity estimator and calculates a non-parametric estimate of the classical Shannon diversity Index for an OTU definition.
- The interpretation of the curve has to be performed in a comparison to the diversity curve of one or more communities.
-
Click on
Collector Curve on data XX:tab (chao) -
The Chao1 estimator is a species richness estimator. The species richness estimators estimate the total number of species present in a community. The Chao1 index estimator the species richness for one community based on the observed species frequencies or presence/absence data.
-
If you compare the rarefaction curve (observed richness) and Chao1 estimator curve (estimated richness), you will observe a gap. The greater estimated richness in comparison to the observed richness usually indicates that with further sampling, the richness would likely continue to increase. Please note that if a sample contains many singletons, it is likely that more undetected OTUs exist, and the Chao1 index will estimate greater species richness than it would for a sample without rare OTUs.
-
These three outputs are tabbed delimited with the following columns:
-
First column (numsampled): is the number sampled and indicates the level of sampling intensity (by default this information is provided every 100 individuals)
-
The second column (0.03) is the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1000 by default. This would represents the data for the y-axis for the corresponding minimum identity (1-0.97=0.03).
-
The third column (lci) and fourth column (hci) represent the confidence intervals
These files can be plotted similar to the rarefaction curve in the previous step.
Plot the collector’s curves
- Click on
Collector Curve on data XX:tab (invsimpson) to extend the dataset information. - Click on the now accessible “Visualize” icon
-
Select “Scatterplot”
-
Under Column 1, highlight “X Column” (the number sampled)
- Under Column 2, highlight “Y Column” (the average number of OTUs that were observed for that sampling intensity based on the number of iterations, which is 1,000 by default).
- Click
Draw
Repeat to plot the curves from the other calculators.
The workflow canvas
References
-
Woese CR, Kandler O, Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bacteria, and Eucarya. Proc Natl Acad Sci U S A 87: 4576-4579.
-
Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, et al. (1985) Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses. Proc Natl Acad Sci U S A 82: 6955-6959.
-
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P (2008) A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev 72: 557-578.
-
Handelsman J, Rondon MR, Brady SF, Clardy J, Goodman RM (1998) Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. Chem Biol 5: R245-249.
-
Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. (2007) The human microbiome project. Nature 449: 804-810.
-
Holmes S, Alekseyenko A, Timme A, Nelson T, Pasricha PJ, et al. (2011) Visualization and statistical comparisons of microbial communities using R packages on phylochip data. Pac Symp Biocomput: 142-153.
-
Hamady M, Lozupone C, Knight R (2010) Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J 4: 17-27.
-
Brodie EL, Desantis TZ, Joyner DC, Baek SM, Larsen JT, et al. (2006) Application of a high-density oligonucleotide microarray approach to study bacterial population dynamics during uranium reduction and reoxidation. Appl Environ Microbiol 72: 6288-6298.