The curatedMetagenomicData package provides taxonomic, functional, and gene marker abundance for samples collected from different bodysites. It provides data from aproximately 3000 human microbiome samples that have been highly processed, refined, and curated such that analysis that might otherwise require a computing cluster can be done on an ordianary laptop.
curatedMetagenomicData
providescuratedMetagenomicData
provides 6 types of data for each dataset:
Types 1-3 are generated by MetaPhlAn2; 4-6 are generated by HUMAnN2.
Currently, curatedMetagenomicData
provides:
These datasets are documented in the reference manual.
curatedMetagenomicData
ResourcesUse of the resources in curatedMetagenomicData
is simplified with the use of Bioconductor’s ExperimentHub platform, which allows for the accessing of data through an intuitive interface. First, curatedMetagenomicData
is installed using BiocInstaller
and then called as a library - the process allows for the user to simply call datasets as functions because the package is aware of the resources present in ExperimentHub S3 buckets.
# BiocInstaller::biocLite("curatedMetagenomicData") #Bioconductor version
# BiocInstaller::biocLite("vobencha/curatedMetagenomicData") #bleeding edge version
library(curatedMetagenomicData)
In the next section we will demonstrate convenience functions that make ExperimentHub transparent, but it can be useful and powerful to interact directly with ExperimentHub. A “hub” connects you to the ExperimentHub server and its metadata, without downloading any data:
suppressPackageStartupMessages(library(ExperimentHub))
eh = ExperimentHub()
## snapshotDate(): 2016-10-01
The following queries ExperimentHub for all records matching “curatedMetagenomicData”:
myquery = query(eh, "curatedMetagenomicData")
This is an abbreviated list of what was found:
myquery
## ExperimentHub with 186 records
## # snapshotDate(): 2016-10-01
## # $dataprovider: Department of Psychology, Abdul Haq Campus, Federal Ur...
## # $species: Homo Sapiens
## # $rdataclass: ExpressionSet
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH178"]]'
##
## title
## EH178 | HMP_2012.genefamilies_relab.anterior_nares
## EH179 | HMP_2012.genefamilies_relab.buccal_mucosa
## EH180 | HMP_2012.genefamilies_relab.hard_palate
## EH181 | HMP_2012.genefamilies_relab.keratinized_gingiva
## EH182 | HMP_2012.genefamilies_relab.l_retroauricular_crease
## ... ...
## EH359 | ZellerG_2014.marker_abundance.stool
## EH360 | ZellerG_2014.marker_presence.stool
## EH361 | ZellerG_2014.metaphlan_bugs_list.stool
## EH362 | ZellerG_2014.pathabundance_relab.stool
## EH363 | ZellerG_2014.pathcoverage.stool
Note that the first column, “EH178” etc, are unique IDs for each dataset. For a full list, mcols(myquery)
produces a data.frame
. The following is unevaluated in this script, but can be used to display an interactive spreadsheet of the results:
View(mcols(myquery))
Or to write the search results to a csv file:
write.csv(mcols(myquery), file="curatedMetagenomicData_allrecords.csv")
Tags can (eventually) help identify useful datasets:
head(myquery$tags)
## [1] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [2] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [3] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [4] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [5] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
## [6] "Homo_sapiens_Data, ReproducibleResearch, MicrobiomeData"
Titles can be used to select body site and data type. Here are the MetaPhlan2 taxonomy tables for all the available Human Microbiome Project (HMP) body sites, found using a Perl regular expression to find “HMP” at the beginning of a string, followed by any number of characters “.*“, followed by the word”metaphlan“:
grep("(?=HMP)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
## [1] "HMP_2012.metaphlan_bugs_list.anterior_nares"
## [2] "HMP_2012.metaphlan_bugs_list.buccal_mucosa"
## [3] "HMP_2012.metaphlan_bugs_list.hard_palate"
## [4] "HMP_2012.metaphlan_bugs_list.keratinized_gingiva"
## [5] "HMP_2012.metaphlan_bugs_list.l_retroauricular_crease"
## [6] "HMP_2012.metaphlan_bugs_list.mid_vagina"
## [7] "HMP_2012.metaphlan_bugs_list.palatine_tonsils"
## [8] "HMP_2012.metaphlan_bugs_list.posterior_fornix"
## [9] "HMP_2012.metaphlan_bugs_list.r_retroauricular_crease"
## [10] "HMP_2012.metaphlan_bugs_list.saliva"
## [11] "HMP_2012.metaphlan_bugs_list.stool"
## [12] "HMP_2012.metaphlan_bugs_list.subgingival_plaque"
## [13] "HMP_2012.metaphlan_bugs_list.supragingival_plaque"
## [14] "HMP_2012.metaphlan_bugs_list.throat"
## [15] "HMP_2012.metaphlan_bugs_list.tongue_dorsum"
## [16] "HMP_2012.metaphlan_bugs_list.vaginal_introitus"
Here are all the metabolic pathway abundance for the stool body site:
grep("(?=.*stool)(?=.*metaphlan)", myquery$title, perl=TRUE, val=TRUE)
## [1] "HMP_2012.metaphlan_bugs_list.stool"
## [2] "KarlssonFH_2013.metaphlan_bugs_list.stool"
## [3] "LeChatelierE_2013.metaphlan_bugs_list.stool"
## [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"
## [5] "LomanNJ_2013_Mi.metaphlan_bugs_list.stool"
## [6] "NielsenHB_2014.metaphlan_bugs_list.stool"
## [7] "Obregon_TitoAJ_2015.metaphlan_bugs_list.stool"
## [8] "QinJ_2012.metaphlan_bugs_list.stool"
## [9] "QinN_2014.metaphlan_bugs_list.stool"
## [10] "RampelliS_2015.metaphlan_bugs_list.stool"
## [11] "ZellerG_2014.metaphlan_bugs_list.stool"
Or, here are all the data products available for the HiSeq component of the “Loman” study:
(lomannames = grep("LomanNJ_2013_Hi", myquery$title, perl=TRUE, val=TRUE))
## [1] "LomanNJ_2013_Hi.genefamilies_relab.stool"
## [2] "LomanNJ_2013_Hi.marker_abundance.stool"
## [3] "LomanNJ_2013_Hi.marker_presence.stool"
## [4] "LomanNJ_2013_Hi.metaphlan_bugs_list.stool"
## [5] "LomanNJ_2013_Hi.pathabundance_relab.stool"
## [6] "LomanNJ_2013_Hi.pathcoverage.stool"
We could create a list of ExpressionSet
objects containing all of the Loman HiSeq products as follows (not evaluated):
(idx = grep("LomanNJ_2013_Hi", myquery$title, perl=TRUE))
loman.list = lapply(idx, function(i){
return(myquery[[i]])
})
names(loman.list) = lomannames
loman.list
Individual data projects can be fetched without having to think about ExperimentHub. A function call to a dataset name returns a Bioconductor ExpressionSet object:
suppressPackageStartupMessages(library(curatedMetagenomicData))
loman.eset = LomanNJ_2013_Hi.metaphlan_bugs_list.stool()
## snapshotDate(): 2016-10-01
## see ?curatedMetagenomicData and browseVignettes('curatedMetagenomicData') for documentation
## loading from cache '/home/biocbuild//.ExperimentHub/289'
ExpressionSet
ObjectsAll datasets are represented as ExpressionSet
objects because of the integrative nature of the class and its ability to bind data and metadata. There are three main functions, from the Biobase
package, that provide access to experiment-level metadata, subject-level metadata, and the data itself.
To access the experiment-level metadata the experimentData()
function is used to return a MIAME
(Minimum Information About a Microarray Experiment) object.
experimentData( loman.eset )
## Experiment data
## Experimenter name: Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ, Quick J, Weir JC, Quince C, Smith GP, Betley JR, Aepfelbacher M, Pallen MJ
## Laboratory: Institute of Microbiology and Infection, University of Birmingham, Birmingham, England.
## Contact information:
## Title: A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4.
## URL:
## PMIDs: 23571589
##
## Abstract: A 308 word abstract is available. Use 'abstract' method.
## notes:
## Sequencing technology:
## Illumina
To access the subject-level metadata the pData()
function is used to return a data.frame
containing subject-level variables of study.
head( pData( loman.eset ) )
## subjectID first repeat stooltexture daysafteronset hus
## OBK1122 OBK1122 1122 0 <NA> NA <NA>
## OBK1196 OBK1196 1196 0 <NA> NA <NA>
## OBK1253 OBK1253 1253 0 <NA> NA <NA>
## OBK2535 OBK2535 2535 0 smooth 3 y
## OBK2535b <NA> NA NA <NA> NA <NA>
## OBK2638 OBK2638 2638 0 bloody 5 y
## stec_count shigatoxin2elisa readsmillions nonhuman stec_coverage
## OBK1122 <NA> <NA> 93.0 46 <NA>
## OBK1196 <NA> <NA> 73.6 >99 <NA>
## OBK1253 <NA> <NA> 82.4 >99 <NA>
## OBK2535 moderate positive 55.0 >99 619
## OBK2535b <NA> <NA> NA <NA> <NA>
## OBK2638 high positive 30.8 32 29
## stxab_detected stx_ratio typingdata c_difficile_frequency
## OBK1122 y NA <NA> positive
## OBK1196 n NA <NA> negative
## OBK1253 n NA <NA> positive
## OBK2535 y 5 y 0.0014
## OBK2535b <NA> NA <NA> <NA>
## OBK2638 y 2 y negative
## disease bodysite country number_reads
## OBK1122 stec2-positive stool germany 464060
## OBK1196 stec2-positive stool germany 30181380
## OBK1253 stec2-positive stool germany 65704818
## OBK2535 stec2-positive stool germany 43597736
## OBK2535b stec2-positive stool germany 22253314
## OBK2638 stec2-positive stool germany 1105492
To access the data itself (in this case relative abundance), the exprs()
function returns a variables by samples (rows by columns) numeric matrix. Note the presence of “synthetic” clades at all levels of the taxonomy, starting with kingdom, e.g. k__bacteria here:
exprs( loman.eset )[1:6, 1:5] #first 6 rows and 5 columns
## OBK1122 OBK1196 OBK1253
## k__Bacteria 100.00000 100.00000 100.00000
## k__Bacteria|p__Bacteroidetes 83.55662 30.41764 79.41203
## k__Bacteria|p__Firmicutes 15.14819 68.30191 19.28784
## k__Bacteria|p__Proteobacteria 1.29520 0.06303 0.97218
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 83.55662 30.41764 79.41203
## k__Bacteria|p__Firmicutes|c__Clostridia 14.93141 33.22720 11.26228
## OBK2535 OBK2535b
## k__Bacteria 99.39953 99.63084
## k__Bacteria|p__Bacteroidetes 15.14525 7.03101
## k__Bacteria|p__Firmicutes 25.71252 45.17325
## k__Bacteria|p__Proteobacteria 58.53298 47.29051
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 15.14525 7.03101
## k__Bacteria|p__Firmicutes|c__Clostridia 16.50313 13.97996
Bioconductor provides further documentation of the ExpressionSet class and has published an excellent introduction.
Here’s a direct, exploratory analysis of E. coli prevalence in the Loman dataset using the ExpressionSet
object. More elegant solutions will be provided later using subsetting methods provided by the phyloseq package, but for users familiar with grep()
and the ExpressionSet
object, such manual methods may suffice.
First, which E. coli-related taxa are available?
grep("coli", rownames(loman.eset), value=TRUE)
## [1] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli"
## [2] "k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli|t__Escherichia_coli_unclassified"
## [3] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis"
## [4] "k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Anaerotruncus|s__Anaerotruncus_colihominis|t__GCF_000154565"
Create a vector of E. coli relative abundances. This grep
call with a “$” at the end selects the only row that ends with “s__Escherichia_coli“:
x = exprs( loman.eset )[grep("s__Escherichia_coli$", rownames( loman.eset)), ]
summary( x )
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.5484 2.3060 13.3800 11.6900 90.4500
This could be plotted as a histogram:
hist( x, xlab = "Relative Abundance", main="Prevalence of E. Coli",
breaks="FD")
phyloseq
For the MetaPhlAn2 bugs datasets (but not other data types), you gain a lot of taxonomy-aware, ecological analysis and plotting by conversion to a phyloseq class object. curatedMetagenomicData provides the ExpressionSet2phyloseq()
function to make this easy:
loman.pseq = ExpressionSet2phyloseq( loman.eset )
Absolute raw count data can be estimated from the relative count data by multiplying the columns of the ExpressionSet
data by the number of reads for each sample, as found in the pData
column “number_reads”. You could do this manually using the sweep()
function:
loman.counts = sweep(exprs( loman.eset ), 2, loman.eset$number_reads, "*")
loman.counts[1:6, 1:5]
## OBK1122 OBK1196
## k__Bacteria 46406000.0 3018138000
## k__Bacteria|p__Bacteroidetes 38775285.1 918046352
## k__Bacteria|p__Firmicutes 7029669.1 2061445900
## k__Bacteria|p__Proteobacteria 601050.5 1902332
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 38775285.1 918046352
## k__Bacteria|p__Firmicutes|c__Clostridia 6929070.1 1002842750
## OBK1253 OBK2535
## k__Bacteria 6570481800 4333594467
## k__Bacteria|p__Bacteroidetes 5217752978 660298611
## k__Bacteria|p__Firmicutes 1267304017 1121007659
## k__Bacteria|p__Proteobacteria 63876910 2551905409
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 5217752978 660298611
## k__Bacteria|p__Firmicutes|c__Clostridia 739986058 719499105
## OBK2535b
## k__Bacteria 2217116367
## k__Bacteria|p__Bacteroidetes 156463273
## k__Bacteria|p__Firmicutes 1005254517
## k__Bacteria|p__Proteobacteria 1052370568
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia 156463273
## k__Bacteria|p__Firmicutes|c__Clostridia 311100440
or just set the relab
argument in Expressionset2phyloseq()
to FALSE
:
loman.pseq2 = ExpressionSet2phyloseq( loman.eset, relab=FALSE )
otu_table( loman.pseq2 )[1:6, 1:5]
## OTU Table: [6 taxa and 5 samples]
## taxa are rows
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2535b
## k__Bacteria 46406000 3018138000 6570481800 4333594467 2217116367
## p__Bacteroidetes 38775285 918046352 5217752978 660298611 156463273
## p__Firmicutes 7029669 2061445900 1267304017 1121007659 1005254517
## p__Proteobacteria 601051 1902332 63876910 2551905409 1052370568
## c__Bacteroidia 38775285 918046352 5217752978 660298611 156463273
## c__Clostridia 6929070 1002842750 739986058 719499105 311100440
Note the simplified row names of the OTU table, resulting from the default argument simplify=TRUE
. The simplified names are convenient, and lossless because taxonomic information is now attainable by tax_table(loman.pseq2)
The phyloseq objects currently contain 3 components, with extractor functions hinted at by its show method:
loman.pseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 736 taxa and 44 samples ]
## sample_data() Sample Data: [ 44 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 736 taxa by 8 taxonomic ranks ]
otu_table()
returns the same thing as exprs( loman.eset )
did, the Operational Taxanomic Unit (OTU) table. Here are the first 6 rows and 5 columns:
otu_table( loman.pseq )[1:6, 1:5]
## OTU Table: [6 taxa and 5 samples]
## taxa are rows
## OBK1122 OBK1196 OBK1253 OBK2535 OBK2535b
## k__Bacteria 100.00000 100.00000 100.00000 99.39953 99.63084
## p__Bacteroidetes 83.55662 30.41764 79.41203 15.14525 7.03101
## p__Firmicutes 15.14819 68.30191 19.28784 25.71252 45.17325
## p__Proteobacteria 1.29520 0.06303 0.97218 58.53298 47.29051
## c__Bacteroidia 83.55662 30.41764 79.41203 15.14525 7.03101
## c__Clostridia 14.93141 33.22720 11.26228 16.50313 13.97996
The same patient or participant data that was available from pData( loman.eset )
is now availble using sample_data()
on the phyloseq object:
sample_data( loman.pseq )[1:6, 1:5]
## subjectID first repeat. stooltexture daysafteronset
## OBK1122 OBK1122 1122 0 <NA> NA
## OBK1196 OBK1196 1196 0 <NA> NA
## OBK1253 OBK1253 1253 0 <NA> NA
## OBK2535 OBK2535 2535 0 smooth 3
## OBK2535b <NA> NA NA <NA> NA
## OBK2638 OBK2638 2638 0 bloody 5
But this object also is aware of the taxonomic structure, which will enable the powerful subsetting methods of the phyloseq package.
head( tax_table( loman.pseq ) )
## Taxonomy Table: [6 taxa by 8 taxonomic ranks]:
## Kingdom Phylum Class Order Family
## k__Bacteria "Bacteria" NA NA NA NA
## p__Bacteroidetes "Bacteria" "Bacteroidetes" NA NA NA
## p__Firmicutes "Bacteria" "Firmicutes" NA NA NA
## p__Proteobacteria "Bacteria" "Proteobacteria" NA NA NA
## c__Bacteroidia "Bacteria" "Bacteroidetes" "Bacteroidia" NA NA
## c__Clostridia "Bacteria" "Firmicutes" "Clostridia" NA NA
## Genus Species Strain
## k__Bacteria NA NA NA
## p__Bacteroidetes NA NA NA
## p__Firmicutes NA NA NA
## p__Proteobacteria NA NA NA
## c__Bacteroidia NA NA NA
## c__Clostridia NA NA NA
The process of subsetting begins with the names of phylogenetic ranks:
rank_names( loman.pseq )
## [1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus" "Species"
## [8] "Strain"
Taxa can be filtered by these rank names. For example, to return an object with only species and strains:
subset_taxa( loman.pseq, !is.na(Species))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 535 taxa and 44 samples ]
## sample_data() Sample Data: [ 44 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 535 taxa by 8 taxonomic ranks ]
To keep only phylum-level data (not class or finer, and not kingdom-level):
subset_taxa( loman.pseq, is.na(Class) & !is.na(Phylum))
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 8 taxa and 44 samples ]
## sample_data() Sample Data: [ 44 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 8 taxa by 8 taxonomic ranks ]
Or to keep only Bacteroidetes phylum. Note that taxa names have been shortened from the rownames of the ExpressionSet
object, for nicer plotting.
loman.bd = subset_taxa( loman.pseq, Phylum == "Bacteroidetes")
head( taxa_names( loman.bd ) )
## [1] "p__Bacteroidetes" "c__Bacteroidia" "o__Bacteroidales"
## [4] "f__Bacteroidaceae" "f__Porphyromonadaceae" "f__Rikenellaceae"
The phyloseq package provides advanced pruning of taxa, such as the following which keeps only taxa that are among the most abundant 5% in at least five samples:
keepotu = genefilter_sample(loman.pseq, filterfun_sample(topp(0.05)), A=5)
summary(keepotu)
## Mode FALSE TRUE NA's
## logical 583 153 0
subset_taxa(loman.pseq, keepotu)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 153 taxa and 44 samples ]
## sample_data() Sample Data: [ 44 samples by 19 sample variables ]
## tax_table() Taxonomy Table: [ 153 taxa by 8 taxonomic ranks ]
Note that phyloseq also provides topk()
for selecting the most abundant k
taxa, and other functions for advanced pruning of taxa.
The phyloseq package provides the plot_heatmap()
function to create heatmaps using a variety of built-in dissimilarity metrics for clustering. Here, we apply the same abundance filter as above, keep only strain-level OTUs. This function supports a large number of distance and ordination methods, here we use Bray-Curtis dissimilarity for distance and PCoA as the ordination method for organizing the heatmap.
loman.filt = subset_taxa(loman.pseq, keepotu & !is.na(Strain))
plot_heatmap(loman.filt, method="PCoA", distance="bray")
Here we plot the top 20 most abundant species (not strains), defined by the sum of abundance across all samples in the dataset:
loman.sp = subset_taxa(loman.pseq, !is.na(Species) & is.na(Strain))
par(mar = c(20, 4, 0, 0) + 0.15) #increase margin size on the bottom
barplot(sort(taxa_sums(loman.sp), TRUE)[1:20] / nsamples(loman.sp),
ylab = "Total counts", las = 2)
The phyloseq package calculates numerous alpha diversity measures. Here we compare three diversity in the species-level data, stratifying by stool texture:
alphas = c("Shannon", "Simpson", "InvSimpson")
plot_richness(loman.sp, "stooltexture", measures = alphas)
Let’s compare these three alpha diversity measures:
pairs( estimate_richness(loman.sp, measures = alphas) )
Numerous beta diversity / dissimilarity are provided by the distance()
function when provided a phyloseq object, and these can be used for any kind of clustering or classification scheme. For example, here is a hierarchical clustering dendrogram produced by the hclust()
from the base R stats
package with “Ward” linkage:
mydist = distance(loman.sp, method="bray")
myhclust = hclust( mydist )
plot(myhclust, main="Bray-Curtis Dissimilarity",
method="ward.D", xlab="Samples", sub = "")
The phyloseq package provides a variety of ordination methods, with convenient options for labelling points. Here is a Principal Coordinates Analysis plot of species-level taxa from the Loman dataset, using Bray-Curtis distance:
ordinated_taxa = ordinate(loman.sp, method="PCoA", distance="bray")
plot_ordination(loman.sp, ordinated_taxa, color="stooltexture",
title = "Bray-Curtis Principal Coordinates Analysis")
plot_scree(ordinated_taxa, title="Screeplot")
Authors welcome the addition of new datasets provided they can be or already have been run through the MetaPhlAn2 and HUMAnN2 pipelines. Please read the developer documentation and contact the maintainer if you have a shotgun metagenomic dataset that would be of interest to the Bioconductor community.
Development of the curatedMetagenomicData
package occurs on GitHub and bugs reported via GitHub issues will recieve the quickest response. Please visit the project repository and file an issue should you find one.
Visit the Bioconductor support site at https://support.bioconductor.org/, breifly describe your issue, and add the #curatedmetagenomicdata tag.