1 Introduction

Ribo-Seq and polyRibo-Seq are a specific form of RNA-Seq gene expression experiments utilizing mRNA subpopulations directly bound to ribosomes. Compared to standard RNA-Seq, their readout of gene expression provides a better approximation of downstream protein abundance profiles due to their close association with translational processes. The most important difference among the two is that polyRibo-Seq utilizes polyribosomal RNA for sequencing, whereas Ribo-Seq is a footprinting approach restricted to sequencing RNA fragments protected by ribosomes (Ingolia et al. 2009; Aspden et al. 2014; Juntawong et al. 2015).

The workflow presented in this vignette contains most of the data analysis steps described by (Juntawong et al. 2014) including functionalities useful for processing both polyRibo-Seq and Ribo-Seq experiments. To improve re-usability and adapt to recent changes of software versions (e.g. R, Bioconductor and short read aligners), the code has been optimized accordingly. Thus, the results obtained with the updated workflow are expected to be similar but not necessarily identical with the published results described in the original paper.

Relevant analysis steps of this workflow include read preprocessing, read alignments against a reference genome, counting of reads overlapping with a wide range of genomic features (e.g. CDSs, UTRs, uORFs, rRNAs, etc.), differential gene expression and differential ribosome binding analyses, as well as a variety of genome-wide summary plots for visualizing RNA expression trends. Functions are provided for evaluating the quality of Ribo-seq data, for identifying novel expressed regions in the genomes, and for gaining insights into gene regulation at the post-transcriptional and translational levels. For example, the functions genFeatures and featuretypeCounts can be used to quantify the expression output for all feature types included in a genome annotation (e.g. genes, introns, exons, miRNAs, intergenic regions, etc.). To determine the approximate read length of ribosome footprints in Ribo-Seq experiments, these feature type counts can be obtained and plotted for specific read lengths separately. Typically, the most abundant read length obtained for translated features corresponds to the approximate footprint length occupied by the ribosomes of a given organism group. Based on the results from several Ribo-Seq studies, these ribosome footprints are typically ~30 nucleotides long (Ingolia, Lareau, and Weissman 2011; Ingolia et al. 2009; Juntawong et al. 2014). However, their length can vary by several nucleotides depending upon the optimization of the RNA digestion step and various factors associated with translational regulation. For quality control purposes of Ribo-Seq experiments it is also useful to monitor the abundance of reads mapping to rRNA genes due to the high rRNA content of ribosomes. This information can be generated with the featuretypeCounts function described above.

Coverage trends along transcripts summarized for any number of transcripts can be obtained and plotted with the functions featureCoverage and plotfeatureCoverage, respectively. Their results allow monitoring of the phasing of ribosome movements along triplets of coding sequences. Commonly, high quality data will display here for the first nucleotide of each codon the highest depth of coverage computed for the 5’ ends of the aligned reads.

Ribo-seq data can also be used to evaluate various aspects of translational control due to ribosome occupancy in upstream open reading frames (uORFs). The latter are frequently present in (or near) 5’ UTRs of transcripts. For this, the function predORFs can be used to identify ORFs in the nucleotide sequences of transcripts or their subcomponents such as UTR regions. After scaling the resulting ORF coordinates back to the corresponding genome locations using scaleRanges, one can use these novel features (e.g. uORFs) for expression analysis routines similar to those employed for pre-existing annotations, such as the exonic regions of genes. For instance, in Ribo-Seq experiments one can use this approach to systematically identify all transcripts occupied by ribosomes in their uORF regions. The binding of ribosomes to uORF regions may indicate a regulatory role in the translation of the downstream main ORFs and/or translation of the uORFs into functionally relevant peptides.

1.1 Experimental design

Typically, users want to specify here all information relevant for the analysis of their NGS study. This includes detailed descriptions of FASTQ files, experimental design, reference genome, gene annotations, etc.

2 Workflow environment

2.1 Load packages and sample data

The systemPipeR package needs to be loaded to perform the analysis steps shown in this report (H Backman and Girke 2016). The package allows users to run the entire analysis workflow interactively or with a single command while also generating the corresponding analysis report. For details see systemPipeR's main vignette.

library(systemPipeR)

2.2 Generate workflow environment

systemPipeRdata package is a helper package to generate a fully populated systemPipeR workflow environment in the current working directory with a single command. All the instruction for generating the workflow template are provide in the systemPipeRdata vignette here.

After building and loading the workflow environment generated by genWorkenvir from systemPipeRdata all data inputs are stored in a data/ directory and all analysis results will be written to a separate results/ directory, while the systemPipeRIBOseq.Rmd script and the targets file are expected to be located in the parent directory. The R session is expected to run from this parent directory. Additional parameter files are stored under param/.

To work with real data, users want to organize their own data similarly and substitute all test data for their own data. To rerun an established workflow on new data, the initial targets file along with the corresponding FASTQ files are usually the only inputs the user needs to provide.

For more details, please consult the documentation here. More information about the targets files from systemPipeR can be found here.

2.3 Run workflow

Now open the R markdown script systemPipeRIBOseq.Rmdin your R IDE (e.g. vim-r or RStudio) and run the workflow as outlined below.

Here pair-end workflow example is provided. Please refer to the main vignette systemPipeR.Rmd for running the workflow with single-end data.

If you are running on a single machine, use following code as an example to check if some tools used in this workflow are in your environment PATH. No warning message should be shown if all tools are installed.

2.4 Experiment definition provided by targets file

The targets file defines all FASTQ files and sample comparisons of the analysis workflow.

targetspath <- system.file("extdata", "targetsPE.txt", package = "systemPipeR")
targets <- read.delim(targetspath, comment.char = "#")[, 1:4]
targets
##                      FileName1                   FileName2
## 1  ./data/SRR446027_1.fastq.gz ./data/SRR446027_2.fastq.gz
## 2  ./data/SRR446028_1.fastq.gz ./data/SRR446028_2.fastq.gz
## 3  ./data/SRR446029_1.fastq.gz ./data/SRR446029_2.fastq.gz
## 4  ./data/SRR446030_1.fastq.gz ./data/SRR446030_2.fastq.gz
## 5  ./data/SRR446031_1.fastq.gz ./data/SRR446031_2.fastq.gz
## 6  ./data/SRR446032_1.fastq.gz ./data/SRR446032_2.fastq.gz
## 7  ./data/SRR446033_1.fastq.gz ./data/SRR446033_2.fastq.gz
## 8  ./data/SRR446034_1.fastq.gz ./data/SRR446034_2.fastq.gz
## 9  ./data/SRR446035_1.fastq.gz ./data/SRR446035_2.fastq.gz
## 10 ./data/SRR446036_1.fastq.gz ./data/SRR446036_2.fastq.gz
## 11 ./data/SRR446037_1.fastq.gz ./data/SRR446037_2.fastq.gz
## 12 ./data/SRR446038_1.fastq.gz ./data/SRR446038_2.fastq.gz
## 13 ./data/SRR446039_1.fastq.gz ./data/SRR446039_2.fastq.gz
## 14 ./data/SRR446040_1.fastq.gz ./data/SRR446040_2.fastq.gz
## 15 ./data/SRR446041_1.fastq.gz ./data/SRR446041_2.fastq.gz
## 16 ./data/SRR446042_1.fastq.gz ./data/SRR446042_2.fastq.gz
## 17 ./data/SRR446043_1.fastq.gz ./data/SRR446043_2.fastq.gz
## 18 ./data/SRR446044_1.fastq.gz ./data/SRR446044_2.fastq.gz
##    SampleName Factor
## 1         M1A     M1
## 2         M1B     M1
## 3         A1A     A1
## 4         A1B     A1
## 5         V1A     V1
## 6         V1B     V1
## 7         M6A     M6
## 8         M6B     M6
## 9         A6A     A6
## 10        A6B     A6
## 11        V6A     V6
## 12        V6B     V6
## 13       M12A    M12
## 14       M12B    M12
## 15       A12A    A12
## 16       A12B    A12
## 17       V12A    V12
## 18       V12B    V12

3 Read preprocessing

3.1 Quality filtering and adaptor trimming

The following custom function trims adaptors hierarchically from the longest to the shortest match of the right end of the reads. If internalmatch=TRUE then internal matches will trigger the same behavior. The argument minpatternlength defines the shortest adaptor match to consider in this iterative process. In addition, the function removes reads containing Ns or homopolymer regions. More detailed information on read preprocessing is provided in systemPipeR's main vignette.

First, we construct SYSargs2 object from cwl and yml param and targets files.

dir_path <- system.file("extdata/cwl/preprocessReads/trim-pe", 
    package = "systemPipeR")
trim <- loadWorkflow(targets = targetspath, wf_file = "trim-pe.cwl", 
    input_file = "trim-pe.yml", dir_path = dir_path)
trim <- renderWF(trim, inputvars = c(FileName1 = "_FASTQ_PATH1_", 
    FileName2 = "_FASTQ_PATH2_", SampleName = "_SampleName_"))
trim
output(trim)[1:2]

Next, we execute the code for trimming all the raw data.

fctpath <- system.file("extdata", "custom_Fct.R", package = "systemPipeR")
source(fctpath)
iterTrim <- ".iterTrimbatch1(fq, pattern='ACACGTCT', internalmatch=FALSE, minpatternlength=6, Nnumber=1, polyhomo=50, minreadlength=16, maxreadlength=101)"
preprocessReads(args = trim, Fct = iterTrim, batchsize = 1e+05, 
    overwrite = TRUE, compress = TRUE)
writeTargetsout(x = trim, file = "targets_trimPE.txt", step = 1, 
    new_col = c("FileName1", "FileName2"), new_col_output_index = c(1, 
        2), overwrite = TRUE)

3.2 FASTQ quality report

The following seeFastq and seeFastqPlot functions generate and plot a series of useful quality statistics for a set of FASTQ files including per cycle quality box plots, base proportions, base-level quality trends, relative k-mer diversity, length and occurrence distribution of reads, number of reads above quality cutoffs and mean quality distribution. The results are written to a PDF file named fastqReport.png.

fqlist <- seeFastq(fastq = infile1(trim), batchsize = 10000, 
    klength = 8)
png("./results/fastqReport.png", height = 18, width = 4 * length(fqlist), 
    units = "in", res = 72)
seeFastqPlot(fqlist)
dev.off()