ggseqalign hexlogo

1 Introduction

Showing small differences between two long strings, such as DNA or AA sequences is challenging, especially in R. Typically, DNA or AA sequence alignments show all characters in a sequence. The package ggmsa does this really well and is compatible with ggplot2. However, this is not viable for sequences over a certain length.
Alternatively, top level visualizations may, for example, represent degree of variation over the length in a line plot, making it possible to gauge how strongly sequences differ, but not the quality of the difference. The intention with this package is to provide a way to visualize sequence alignments over the whole length of arbitrarily long sequences without losing the ability to show small differences, see figure 1.

Example of ggseqalign visualization. Showcase of the package's capability to highlight differences between 2000 bp long DNA sequences.

Figure 1: Example of ggseqalign visualization
Showcase of the package’s capability to highlight differences between 2000 bp long DNA sequences.

2 Installation

Until the next major version of Bioconductor (expected October 2024), ggseqalign can be installed from the Devel version of Bioconductor.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "devel")
BiocManager::valid()              # checks for out of date packages
BiocManager::install("ggseqalign")

See the BiocManager vignette for instructions on using multiple versions of Bioconductor.

ggseqalign can also be installed from it’s original source on GitHub (requires devtools)

devtools::install_git("https://github.com/simeross/ggseqalign.git")

3 Basics

This package relies on two core functions, alignment_table() and plot_sequence_alignment(). At its core, the former uses PairwiseAlignment(), previously in Biostrings, now in pwalign, to align one or several query strings to a subject string to parse all information on mismatches, insertions and deletions into a table that is used as the input for plotting with plot_sequence_alignment().

A minimal example:

library(ggseqalign)
library(ggplot2)

query_strings <- (c("boo", "fibububuzz", "bozz", "baofuzz"))
subject_string <- "boofizz"

alignment <- alignment_table(query_strings, subject_string)

plot_sequence_alignment(alignment) +
    theme(text = element_text(size = 15))

Figure 2: Output of the minimal example code

This package is fully compatible with DNAStringSetand AAStringSet classes from Biostrings, an efficient and powerful way to handle sequence data. The two examples below use example data from the Biostrings package and requires it to be installed. To install Biostrings, enter

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("Biostrings")

This chunk demonstrates reading sequence data from a FASTA file into a DNAStringSet-class object and aligning it to a manually created DNAStringSet-class object.

library(ggseqalign)
library(Biostrings)
library(ggplot2)

query_sequences <- Biostrings::readDNAStringSet(system.file("extdata",
    "fastaEx.fa",
    package = "Biostrings"
))
subject_sequence <- DNAStringSet(paste0("AAACGATCGATCGTAGTCGACTGATGT",
                                        "AGTATATACGTCGTACGTAGCATCGTC",
                                        "AGTTACTGCATGCCGG"))

alignment <- alignment_table(query_sequences, subject_sequence)

plot_sequence_alignment(alignment) +
    theme(text = element_text(size = 15))

4 Hide mismatches

The plots that plot_sequence_alignment() generates can become hard to read if there are too many differences, see fig. 3. The package allows to hide character mismatches to preserve legibility of structural differences (fig. 4).

# load
dna <- Biostrings::readDNAStringSet(system.file("extdata",
    "dm3_upstream2000.fa.gz",
    package = "Biostrings"
))
q <- as(
    c(substr(dna[[1]], 100, 300)),
    "DNAStringSet"
)
s <- as(
    c(substr(dna[[2]], 100, 300)),
    "DNAStringSet"
)
names(q) <- c("noisy alignment")
names(s) <- "reference"

plot_sequence_alignment(alignment_table(q, s)) +
    theme(text = element_text(size = 15))

Figure 3: Example of a case where ggseqalign fails
If there are too many differences, the mismatches overlap each other and become noisy.

plot_sequence_alignment(alignment_table(q, s), hide_mismatches = TRUE) +
    theme(text = element_text(size = 15))

Hiding mismatches. Hiding character mismatches reduces visual noise if alignments have many character mismatches and preserves structural information.

Figure 4: Hiding mismatches
Hiding character mismatches reduces visual noise if alignments have many character mismatches and preserves structural information.

5 Styling with ggplot2

Since plot_sequence_alignment() produces a ggplot-class object, all aspects of the plots can be modified with ggplot2 functions, such as theme(). As an example, let’s recreate and modify figure 1.

library(ggseqalign)
library(ggplot2)
library(Biostrings)

dna <- readDNAStringSet(system.file("extdata", "dm3_upstream2000.fa.gz",
    package = "Biostrings"
))

q <- dna[2:4]
s <- dna[5]

q[1] <- as(
    replaceLetterAt(q[[1]], c(5, 200, 400), "AGC"),
    "DNAStringSet"
)
q[2] <- as(
    c(substr(q[[2]], 300, 1500), substr(q[[2]], 1800, 2000)),
    "DNAStringSet"
)
q[3] <- as(
    replaceAt(
        q[[3]], 1500,
        paste(rep("A", 1000), collapse = "")
    ),
    "DNAStringSet"
)
names(q) <- c("mismatches", "deletions", "insertion")
names(s) <- "reference"

pl <- plot_sequence_alignment(alignment_table(q, s))

pl <- pl +
    ylab("Sequence variants") +
    xlab("Length in bp") +
    scale_color_viridis_d() +
    theme(
        text = element_text(size = 20),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
        axis.title = element_text()
    )

pl

Figure 5: Styling with ggplot2
In this example, text size was increased, axis labels were added, x-axis text rotated and the color scheme changed.

Some modifications may require digging into the plot object layers, this can get finicky but is possible. We can use pl$layers to get a summary of the object’s layers. In this case, the geom_point layers that plot the dots for mismatches are entry 8 in the layer list, the white bar that indicates deletions is usually in layer 2. You may want to change the deletion bar’s color if you use another plot background color. This code chunk modifies the pl object from the previous chunk; the above chunk has to be run prior to this one.

# Define background color
bg <- "grey90"

# Change plot background
pl <- pl + theme(panel.background = element_rect(
    fill = bg,
    colour = bg
))

# Match deletion to background
pl$layers[[2]]$aes_params$colour <- bg
# Increase mismatch indicator size and change shape
pl$layers[[8]]$aes_params$size <- 2
pl$layers[[8]]$aes_params$shape <- 4
pl$layers[[8]]$aes_params$colour <- "black"

pl

Modifying ggplot2 layers. In this example, deletion bars were adjusted to match background color and mismatch indicators were modified using plot layer modification

Figure 6: Modifying ggplot2 layers
In this example, deletion bars were adjusted to match background color and mismatch indicators were modified using plot layer modification

6 Alignment parameters

Any additional parameters to alignment_table() are passed on to pwalign::pairwiseAlignment(), check pwalign for a comprehensive overview over the available options. As a simple example, we may increase gap penalties for the alignment in 2.

library(ggseqalign)
library(ggplot2)

query_strings <- (c("boo", "fibububuzz", "bozz", "baofuzz"))
subject_string <- "boofizz"

alignment <- alignment_table(query_strings, subject_string, gapOpening = 20)

plot_sequence_alignment(alignment) +
    theme(text = element_text(size = 15))

Figure 7: Modified alignment parameters

7 Session info

The output in this vignette was produced under the following conditions:

sessionInfo()

## R version 4.4.0 RC (2024-04-16 r86468)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggplot2_3.5.1       ggseqalign_0.99.1   Biostrings_2.73.1  
## [4] GenomeInfoDb_1.41.1 XVector_0.45.0      IRanges_2.39.0     
## [7] S4Vectors_0.43.0    BiocGenerics_0.51.0 BiocStyle_2.33.0   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9              utf8_1.2.4              generics_0.1.3         
##  [4] digest_0.6.35           magrittr_2.0.3          evaluate_0.23          
##  [7] grid_4.4.0              bookdown_0.39           fastmap_1.2.0          
## [10] jsonlite_1.8.8          tinytex_0.51            BiocManager_1.30.23    
## [13] httr_1.4.7              fansi_1.0.6             viridisLite_0.4.2      
## [16] UCSC.utils_1.1.0        scales_1.3.0            jquerylib_0.1.4        
## [19] cli_3.6.2               rlang_1.1.4             crayon_1.5.2           
## [22] munsell_0.5.1           withr_3.0.0             cachem_1.1.0           
## [25] yaml_2.3.8              tools_4.4.0             dplyr_1.1.4            
## [28] colorspace_2.1-0        GenomeInfoDbData_1.2.12 vctrs_0.6.5            
## [31] R6_2.5.1                magick_2.8.3            lifecycle_1.0.4        
## [34] zlibbioc_1.51.1         pwalign_1.1.0           pkgconfig_2.0.3        
## [37] pillar_1.9.0            bslib_0.7.0             gtable_0.3.5           
## [40] Rcpp_1.0.12             glue_1.7.0              highr_0.11             
## [43] xfun_0.44               tibble_3.2.1            tidyselect_1.2.1       
## [46] knitr_1.47              farver_2.1.2            htmltools_0.5.8.1      
## [49] rmarkdown_2.27          labeling_0.4.3          compiler_4.4.0

8 Credit

The research and data generation that was a major motivation for me to finally create this package has received funding from the Norwegian Financial Mechanism 2014-2021, project DivGene: UMO-2019/34/H/NZ9/00559

Quickstart Guide to ggseqalign

9 June 2024

Abstract

Package

Contents