Rqc - Quality Control Tool for High-Throughput Sequencing Data

April 9, 2014

Welliton Souza and Benilton Carvalho

Rqc is an optimized tool designed for quality control and assessment of high-throughput sequencing data. It performs parallel processing of entire files and produces a report which contains a set of high-resolution graphics that can be used for quality assessment.

This version of Rqc produces high-quality images for the following statistics:

Basic Workflow

The main goal of Rqc is to provide graphical tools for quality assessment of reads contained in FASTQ files. This package is designed focusing on simplicity of use. Therefore, the Rqc package allows the user to call one single function called rqc. The rqc method processes a set of input files and generates an HTML report containing several plots that can be used for quality assessment.

To access this functionality, the user needs to load Rqc package.


The next step is to determine the location of the FASTQ files that should be analyzed. The example below, uses sample files provided by the ShortRead package, but the user must modify this location accordingly, in order to reflect the actual location of the files that need QA.

folder <- system.file(package="ShortRead", "extdata/E-MTAB-1147")

The basic usage of the rqc function requires the definition of 2 arguments. One, path, is the location where the files of interest are saved at (this was defined on the step above). The other argument, pattern, is a regular expression that identifies all files of interest. Below, we use .fastq.gz to specify that all files containing that string are to be processed.

rqc(path = folder, pattern = ".fastq.gz")

At this point, the user's default Internet browser will open an HTML file. This file is the report generated by Rqc, which, by default, is stored in a temporary directory. A sample report is shown below:

Rqc 1.2.0 - Quality Control Report

Average Quality

This plot describes the average quality pattern by showing on the X-axis quality thresholds and on the Y-axis the percentage of reads that exceed that quality level.


plot of chunk average-quality-plot

Cycle-specific Average Quality

This describes the average quality scores for each cycle of sequencing.


plot of chunk cycle-average-quality-plot

Read Length Distribution

Barplot that presents the distribuition of the lengths of the reads available in the FASTQ file.


plot of chunk read-width-plot

Cycle-specific GC Content

Line plot showing the average GC content for every cycle of sequencing.


plot of chunk cycle-gc-plot

Cycle-specific Quality Distribution

Bar plot showing the proportion of quality calls per cycle. Colors are presented in a gradient Red-Blue, where red identifies calls of lower quality. This visualization is preferred as it is cleaner than the boxplots described below.


plot of chunk cycle-quality-plots

Cycle-specific Quality Distribution - Boxplot

Boxplots describing empirical patterns of quality distribution on each cycle of sequencing.


plot of chunk cycle-quality-boxplots

Cycle-specific Base Call Proportion

This bar plot describes the proportion of each nucleotide called for every cycle of sequencing.


plot of chunk cycle-basecall-plots

The line plot shows a more detailed view.


plot of chunk cycle-basecall-lineplots

It is important to note that the rqc function samples 1 million records from the FASTQ files. This can be set by adjusting the n argument for this function. If the user desires to have the file processed as a whole (rather than sampling records from it), (s)he must set the argument sample to FALSE.

Advanced Workflow

The rqc function wraps a set of functions to generate a quick report that can be used for quality assessment. However, users can perform a step-by-step analysis by using the information described below.

1. Defining input files

If one wants to process a set of files that are not located at the same directory, the users needs to create a vector containing the absolute path of files. The list.files function can be useful.

fastqDir <- system.file(package="ShortRead", "extdata/E-MTAB-1147")
files <- list.files(fastqDir, "fastq.gz", full.names=TRUE)

The example input files are samples from a public data set. These samples are available through the ShortRead package. More information regarding these data can be found on the vignette of that package.

2. Processing files

To process the files without generating an HTML report, the user should use rqcQA function instead rqc. This function receives a vector containing the paths of the input files.

rqcResultSet <- rqcQA(files = files)

The rqcQA function returns a list that contains the required information to create the plots present on the standard HTML report. Actually, both rqc and rqcQA return a named list of RqcResultSet objects. This output can be used directly as input to other Rqc package functions. Examples of functions that can use these objects are rqcReport and plot. RqcResultSet is a class that extends .QA class of ShortRead package. An RqcResultSet object contains information in two different perspectives: read-specific data and cycle-specific data. They can be accessed using [[ brackets.

3. Generating report

To create a final HTML report, the user can apply the rqcReport function to the rqcResultSet object.

reportFile <- rqcReport(rqcResultSet)

The report generated by rqcReport contains the all plots described on the beginning of this document. By default, it is written to a temporary directory. This behavior can by modified by setting the outdir argument.

Parallel processing

The Rqc package performs parallel processing of the samples by interfacing with the BiocParallel package. By default, BiocParallel loads MulticoreParam with the maximum number of workers available. At this point, Rqc supports only MulticoreParam for parallel processing. It is possible to process the input files serially throught the use of the SerialParam function.


For each plot generated by Rqc, there is a function that shapes the data appropriately. The shaped information is then used to produce the final plot. The example below shows how the user can access these data to generate plots using other tools.

df <- rqcCycleAverageQualityCalc(rqcResultSet)
cycle <- as.numeric(levels(df$cycle))[df$cycle]
plot(cycle, df$quality, col = df$filename, xlab='Cycle', ylab='Quality Score')

plot of chunk calc

One can also process a subset of result data using subsets of a list.

sublist <- rqcResultSet[1]

plot of chunk subset

Final Considerations

The Rqc package provides a simple interface to generate plots often used for quality assessment of high-throughput sequencing data. It uses the standard Bioconductor parallelization framework to add efficiency to data processing. The images produced by the package are high-quality figures that can be directly used on publications.

Session Information

## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## other attached packages:
##  [1] Rqc_1.2.0               ggplot2_1.0.1          
##  [3] ShortRead_1.26.0        GenomicAlignments_1.4.0
##  [5] Rsamtools_1.20.0        GenomicRanges_1.20.0   
##  [7] GenomeInfoDb_1.4.0      Biostrings_2.36.0      
##  [9] XVector_0.8.0           IRanges_2.2.0          
## [11] S4Vectors_0.6.0         BiocGenerics_0.14.0    
## [13] BiocParallel_1.2.0      BiocStyle_1.6.0        
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.5          knitr_1.9            MASS_7.3-40         
##  [4] zlibbioc_1.14.0      munsell_0.4.2        colorspace_1.2-6    
##  [7] lattice_0.20-31      plyr_1.8.1           stringr_0.6.2       
## [10] hwriter_1.3.2        tools_3.2.0          grid_3.2.0          
## [13] gtable_0.1.2         Biobase_2.28.0       snow_0.3-13         
## [16] latticeExtra_0.6-26  lambda.r_1.1.7       futile.logger_1.4   
## [19] digest_0.6.8         reshape2_1.4.1       RColorBrewer_1.1-2  
## [22] formatR_1.1          futile.options_1.0.0 bitops_1.0-6        
## [25] evaluate_0.6         labeling_0.3         scales_0.2.4        
## [28] markdown_0.7.4       proto_0.3-10