CAGEr 1.32.1
This document briefly how to use CAGEr CAGEr, a Bioconductor package designed to process, analyse and visualise Cap Analysis of Gene Expression (CAGE)sequencing data. CAGE (Kodzius et al. 2006) is a high-throughput method for transcriptome analysis that utilizes cap trapping (Carninci et al. 1996), a technique based on the biotinylation of the 7-methylguanosine cap of Pol II transcripts, to pulldown the 5′-complete cDNAs reversely transcribed from the captured transcripts. A linker sequence is ligated to the 5′ end of the cDNA and a specific restriction enzyme is used to cleave off a short fragment from the 5′ end. Resulting fragments are then amplified and sequenced using massive parallel high-throughput sequencing technology, which results in a large number of short sequenced tags that can be mapped back to the referent genome to infer the exact position of the transcription start sites (TSSs) used for transcription of captured RNAs (Figure 1). The number of CAGE tags supporting each TSS gives the information on the relative frequency of its usage and can be used as a measure of expression from that specific TSS. Thus, CAGE provides information on two aspects of capped transcriptome: genome-wide 1bp-resolution map of TSSs and transcript expression levels. This information can be used for various analyses, from 5′ centered expression profiling (Takahashi et al. 2012) to studying promoter architecture (Carninci et al. 2006).
CAGE samples derived from various organisms (genomes) can be analysed by CAGEr and the only limitation is the availability of the referent genome as a BSgenome package in case when raw mapped CAGE tags are processed. CAGEr provides a comprehensive workflow that starts from mapped CAGE tags and includes reconstruction of TSSs and promoters and their visualisation, as well as more specialized downstream analyses like promoter width, expression profiling and differential TSS usage. It can use both Binary Sequence Alignment Map (BAM) files of aligned CAGE tags or files with genomic locations of TSSs and number of supporting CAGE tags as input. If BAM files are provided CAGEr constructs TSSs from aligned CAGE tags and counts the number of tags supporting each TSS, while allowing filtering out low-quality tags and removing technology-specific bias. It further performs normalization of raw CAGE tag count, clustering of TSSs into tag clusters (TC) and their aggregation across multiple CAGE experiments into promoters to construct the promoterome. Various methods for normalization and clustering of TSSs are supported. Exporting data into different types of track files allows various visualisations of TSSs and clusters (promoters) in the UCSC Genome Browser, which facilitate generation of hypotheses. CAGEr manipulates multiple CAGE experiments at once and performs analyses across datasets, including expression profiling and detection of differential TSS usage (promoter shifting). Multicore option for parallel processing is supported on Unix-like platforms, which significantly reduces computing time.
Here are some of the functionalities provided in this package:
Reading in multiple CAGE datasets from various sources; user provided BAM or TSS input files, public CAGE datasets from accompanying data package.
Correcting systematic G nucleotide addition bias at the 5′ end of CAGE tags.
Plotting pairwise scatter plots, calculating correlation between datasets and merging datasets.
Normalizing raw CAGE tag count: simple tag per million (tpm) or power-law based normalization (Balwierz et al. 2009).
Clustering individual TSSs into tag clusters (TCs) and aggregating clusters across multiple CAGE datasets to create a set of consensus promoters.
Making bedGraph or BED files of individual TSSs or clusters for visualisation in the genome browser.
Expression clustering of individual TSSs or consensus promoters into distinct expression profiles using common clustering algorithms.
Calculating promoter width based on the cumulative distribution of CAGE signal along the promoter.
Scoring and statistically testing differential TSS usage (promoter shifting) and detecting promoters that shift between two samples.
Several data packages are accompanying CAGEr package. They contain majority of the up-to-date publicly available CAGE data produced by major consortia including FANTOM and ENCODE. These include FANTOM3and4CAGE package available from Bioconductor, as well as ENCODEprojectCAGE and ZebrafishDevelopmentalCAGE packages available from http://promshift.genereg.net/CAGEr/. In addition, direct fetching of TSS data from FANTOM5 web resource (the largest collection of TSS data for human and mouse) from within CAGEr is also available. These are all valuable resources of genome-wide TSSs in various tissue/cell types for various model organisms that can be used directly in R. Section 5 in this vignette describes how these public datasets can be included into a workflow provided by CAGEr. For further information on the content of the data packages and the list of available CAGE datasets please refer to the vignette of the corresponding data package.
For further details on the implemented methods and for citing the CAGEr package in your work please refer to (Haberle et al. 2015).
CAGEr package supports three types of CAGE data input:
Sequenced CAGE tags mapped to the genome: either BAM (Binary Sequence Alignment Map) files of sequenced CAGE tags aligned to the referent genome (including the paired-end data such as CAGEscan) or BED files of CAGE tags (fragments).
CAGE detected TSSs (CTSSs): tab separated files with genomic coordinates of TSSs and number of tags supporting each TSS. The file should not contain a header and the data must be organized in four columns:
Publicly available CAGE datasets from R data package: Several data packages containing CAGE data for various organisms produced by major consortia are accompanying this package. Selected subset of these data can be used as input for .
The type and the format of the input files is specified at the beginning of the workflow, when the
CAGEset
object is created (section 4.2). This is done by setting the inputFilesType
argument,
which accepts the following self-explanatory options referring to formats mentioned above:
"bam", "bamPairedEnd", "bed", "ctss", "CTSStable"
.
In addition, the package provides a method for coercing a data.frame
object containing single
base-pair TSS information into a CAGEset
object (as described in section 4.2.1), which can be
further used in the workflow described below.
We start the workflow by creating a CAGEexp object, which is a container for storing CAGE datasets and all the results that will be generated by applying specific functions. The CAGEexp objects are an extension of the MultiAssayExperiment class, and therefore can use all their methods. The expression data is stored in CAGEexp using SummarizedExperiment objects, and can also access their methods.
To load the CAGEr package and the other libraries into your R envirnoment type:
library(CAGEr)
In this tutorial we will be using data from zebrafish Danio rerio that was
mapped to the danRer7
assembly of the genome. Therefore, the corresponding
genome package BSgenome.Drerio.UCSC.danRer7 has to be installed.
It will be automatically loaded by CAGEr commands when needed.
In case the data is mapped to a genome that is not readily available through
BSgenome package (not in the list returned by BSgenome::available.genomes()
function), a custom BSgenome package has to be build and installed first.
(See the vignette within the BSgenome package for instructions on how to build
a custom genome package). The genomeName
argument can then be set to the name
of the build genome package when creating a CAGEexp
object (see the section
Creating CAGEexp
object below).
The BSgenome package is used to access information about the chromosomes
(sequence, length, circularity). By default, CAGEr will discard alignments
that are not on chromosomes named in the BSgenome package. The genomeName
argument can be set to NULL
in order to suppress this behaviour, or as a last
resort when no BSgenome package is available. However, CAGEr functions that
need access to the genome sequence, for instance for G-correction will not
work in that case.
The subset of zebrafish (Danio rerio) developmental time-series CAGE data generated by (Nepal et al. 2013) will be used in the following demonstration of the CAGEr workflow.
Files with genomic coordinates of TSSs detected by CAGE in 4 zebrafish
developmental stages are included in this package in the extdata
subdirectory.
The files contain TSSs from a part of chromosome 17 (26,000,000-46,000,000), and
there are two files for one of the developmental stages (two independent
replicas). The data in files is organized in four tab-separated columns as
described above in section 2.
inputFiles = list.files( system.file("extdata", package = "CAGEr")
, "ctss$"
, full.names = TRUE)
The CAGEexp object is crated with the CAGEexp
contstructor, that requries
information on file path an type, sample names and reference genome name.
ce <- CAGEexp( genomeName = "BSgenome.Drerio.UCSC.danRer7"
, inputFiles = inputFiles
, inputFilesType = "ctss"
, sampleLabels = sub( ".chr17.ctss", "", basename(inputFiles))
)
To display the created object type:
ce
## A CAGEexp object of 0 listed
## experiments with no user-defined names and respective classes.
## Containing an ExperimentList class object of length 0:
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save all data to files
The supplied information can be seen in the Input data information section, whereas all other slots are still empty, since no data has been read yet and no analysis conducted.
In case when the CAGE / TSS data is to be read from input files, an empty CAGEexp object with
information about the files is first created as described above in section 3.2.
To actually read in the data into the object we use getCTSS()
function, that will add
an experiment called tagCountMatrix
to the CAGEexp object.
getCTSS(ce)
ce
## A CAGEexp object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] tagCountMatrix: RangedSummarizedExperiment with 23343 rows and 5 columns
## Functionality:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample coordination DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
## exportClass() - save all data to files
This function reads the provided files in the order they were specified in the
inputFiles
argument. It creates a single set of all TSSs detected across all
input datasets (union of TSSs) and a table with counts of CAGE tags supporting
each TSS in every dataset. (Note that in case when a CAGEr object is
created by coercion from an existing expression table there is no need to call
getCTSS()
).
Genomic coordinates of all TSSs and numbers of supporting CAGE tags in every input
sample can be retrieved using the CTSStagCountSE()
function. CTSScoordinatesGR()
accesses
the CTSS coordinates and CTSStagCountDF()
accesses the CTSS expression values.1 Data can also
be accessed directly using the native methods of the MultiAssayExperiment
and
SummarizedExperiment
classes, for example ce[["tagCountMatrix"]]
,
rowRanges(ce[["tagCountMatrix"]])
and assay(ce[["tagCountMatrix"]])
.
CTSStagCountSE(ce)
## class: RangedSummarizedExperiment
## dim: 23343 5
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(5): Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
## Zf.unfertilized.egg
## colData names(0):
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 0 metadata columns:
## seqnames pos strand
## <Rle> <integer> <Rle>
## [1] chr17 26027430 +
## [2] chr17 26050540 +
## [3] chr17 26118088 +
## [4] chr17 26142853 +
## [5] chr17 26166954 +
## ... ... ... ...
## [23339] chr17 45975041 -
## [23340] chr17 45975540 -
## [23341] chr17 45975544 -
## [23342] chr17 45982697 -
## [23343] chr17 45999921 -
## -------
## seqinfo: 26 sequences (1 circular) from danRer7 genome
## BSgenome name: BSgenome.Drerio.UCSC.danRer7
CTSStagCountDF(ce)
## DataFrame with 23343 rows and 5 columns
## Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
## <Rle> <Rle> <Rle> <Rle> <Rle>
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 1 0 0
## ... ... ... ... ... ...
## 23339 1 0 0 0 0
## 23340 0 2 0 0 0
## 23341 0 1 0 0 0
## 23342 0 0 1 0 0
## 23343 1 0 0 0 0
For compatiblity with earlier CAGEr works using CAGEset objects, and to provide simpler data
formats, the coordinates and expression values can also be accessed as simple data.frames
.
Note howerver that with large data sets it can cause extreme performance issues.
head(CTSScoordinates(ce))
## chr pos strand
## 1 chr17 26027430 +
## 2 chr17 26050540 +
## 3 chr17 26118088 +
## 4 chr17 26142853 +
## 5 chr17 26166954 +
## 6 chr17 26222417 +
head(CTSStagCountDf(ce))
## Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2 Zf.unfertilized.egg
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 1 0 0
## 4 0 0 0 1 0
## 5 0 0 1 0 0
## 6 1 1 0 0 0
head(CTSStagCount(ce))
## chr pos strand Zf.30p.dome Zf.high Zf.prim6.rep1 Zf.prim6.rep2
## 1 chr17 26027430 + 0 0 1 0
## 2 chr17 26050540 + 0 0 0 0
## 3 chr17 26118088 + 0 0 1 0
## 4 chr17 26142853 + 0 0 0 1
## 5 chr17 26166954 + 0 0 1 0
## 6 chr17 26222417 + 1 1 0 0
## Zf.unfertilized.egg
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 0
Note that the samples are ordered in the way they were supplied when creating the CAGEexp object and will be presented in that order in all the results and plots. To check sample labels and their ordering type:
sampleLabels(ce)
## #FF0000 #CCFF00 #00FF66
## "Zf.30p.dome" "Zf.high" "Zf.prim6.rep1"
## #0066FF #CC00FF
## "Zf.prim6.rep2" "Zf.unfertilized.egg"
In addition, a colour is assigned to each sample, which is consistently used to depict that sample
in all the plots. By default a rainbow palette of colours is used and the hexadecimal format of
the assigned colours can be seen as names attribute of sample labels shown above. The colours can
be changed to taste at any point in the workflow using the setColors()
function.
By design, CAGE tags map transcription start sites and therefore detect promoters. Quantitatively, the proportion of tags that map to promoter regions will depend both on the quality of the libraries and the quality of the genome annotation, which may be incomplete. Nevertheless, strong variations between libraries prepared in the same experiment may be used for quality controls.
CAGEr can intersect CTSSes with reference transcript models and annotate them with
the name(s) of the models, and the region categories promoter, exon, intron and
unknown, by using the annotateCTSS
function. The reference models can be GENCODE
loaded with the import.gff
function of the rtracklayer package,
or any other input that has the same structure, see help("annotateCTSS")
for details.
In this example, we will use a sample annotation for zebrafish (see help("exampleZv9_annot")
).
annotateCTSS(ce, exampleZv9_annot)
The annotation results are stored as tag counts in the sample metadata, and as new columns in the CTSS genomic ranges
colData(ce)[,c("librarySizes", "promoter", "exon", "intron", "unknown")]
## DataFrame with 5 rows and 5 columns
## librarySizes promoter exon intron unknown
## <integer> <integer> <integer> <integer> <integer>
## Zf.30p.dome 41814 37843 2352 594 1025
## Zf.high 45910 41671 2848 419 972
## Zf.prim6.rep1 34053 29531 2714 937 871
## Zf.prim6.rep2 34947 30799 2320 834 994
## Zf.unfertilized.egg 56140 51114 2860 400 1766
CTSScoordinatesGR(ce)
## CTSS object with 23343 positions and 2 metadata columns:
## seqnames pos strand | genes annotation
## <Rle> <integer> <Rle> | <Rle> <Rle>
## [1] chr17 26027430 + | unknown
## [2] chr17 26050540 + | grid1a promoter
## [3] chr17 26118088 + | grid1a exon
## [4] chr17 26142853 + | grid1a intron
## [5] chr17 26166954 + | grid1a exon
## ... ... ... ... . ... ...
## [23339] chr17 45975041 - | unknown
## [23340] chr17 45975540 - | unknown
## [23341] chr17 45975544 - | unknown
## [23342] chr17 45982697 - | unknown
## [23343] chr17 45999921 - | unknown
## -------
## seqinfo: 26 sequences (1 circular) from danRer7 genome
## BSgenome name: BSgenome.Drerio.UCSC.danRer7
A function plotAnnot
is provided to plot the annotations as stacked bar plots.
Here, all the CAGE libraries look very promoter-specific.
plotAnnot(ce, "counts")
## Warning: Removed 20 rows containing missing values (geom_segment).
## Warning: Removed 20 rows containing missing values (geom_point).
As part of the basic sanity checks, we can explore the data by looking at the
correlation between the samples. The plotCorrelation2()
function will plot
pairwise scatter plots of expression scores per TSS or consensus cluster and
calculate correlation coefficients between all possible pairs of
samples2 Alternatively, the plotCorrelation()
function does the same and
colors the scatterplots according to point density, but is much slower.. A
threshold can be set, so that only regions with an expression score (raw or
normalized) above the threshold (either in one or both samples) are
considered when calculating correlation. Three different correlation measures
are supported: Pearson’s, Spearman’s and Kendall’s correlation coefficients.
Note that while the scatterplots are on a logarithmic scale with pseudocount
added to the zero values, the correlation coefficients are calculated on
untransformed (but thresholded) data.
corr.m <- plotCorrelation2( ce, samples = "all"
, tagCountThreshold = 1, applyThresholdBoth = FALSE
, method = "pearson")
Based on calculated correlation we might want to merge and/or rearrange some of the datasets. To
rearrange the samples in the temporal order of the zebrafish development (unfertilized egg -> high
-> 30 percent dome -> prim6) and to merge the two replicas for the prim6 developmental stage we use
the mergeSamples()
function:
mergeSamples(ce, mergeIndex = c(3,2,4,4,1),
mergedSampleLabels = c("zf_unfertilized_egg", "zf_high", "zf_30p_dome", "zf_prim6"))
annotateCTSS(ce, exampleZv9_annot)
The mergeIndex
argument controls which samples will be merged and how the final dataset will be
ordered. Samples labeled by the same number (in our case samples three and four) will be merged
together by summing number of CAGE tags per TSS. The final set of samples will be ordered in the
ascending order of values provided in mergeIndex
and will be labeled by the labels provided in
the mergedSampleLabels
argument. Note that mergeSamples
function resets all slots with results
of downstream analyses, so in case there were any results in the CAGEexp object prior to merging,
they will be removed. Thus, annotation has to be redone.
Library sizes (number of total sequenced tags) of individual experiments differ, thus
normalization is required to make them comparable. The librarySizes
function returns the total
number of CAGE tags in each sample:
librarySizes(ce)
## [1] 56140 45910 41814 69000
The CAGEr package supports both simple tags per million normalization and power-law based normalization. It has been shown that many CAGE datasets follow a power-law distribution (Balwierz et al. 2009). Plotting the number of CAGE tags (X-axis) against the number of TSSs that are supported by <= of that number of tags (Y-axis) results in a distribution that can be approximated by a power-law. On a log-log scale this reverse cumulative distribution will manifest as a monotonically decreasing linear function, which can be defined as
\[y = -1 * \alpha * x + \beta\]
and is fully determined by the slope \(\alpha\) and total number of tags T (which together with \(\alpha\) determines the value of \(\beta\)).
To check whether our CAGE datasets follow power-law distribution and in which range of values, we
can use the plotReverseCumulatives
function:
plotReverseCumulatives(ce, fitInRange = c(5, 1000), onePlot = TRUE)