Introduction

Comparing genomic DNA sequences of individuals of the same species reveals positions where single nucleotide variations (SNVs) occur. SNVs can influence transcription, RNA processing and translation, indicated by lower frequency of variants within sequence elements like transcription or translation start sites, splice sites and splicing regulatory elements. When localized within the coding sequence of a gene, SNVs can, among others, affect which amino acids are encoded by the altered codon, potentially leading to disease. Approximately 88% of human SNVs associated with disease are, however, not located within the coding sequence of genes, but within intronic and intergenic sequence segments. Nevertheless, annotations referring to the coding sequence of a specific transcript are still widely used, e.g. c.8754+3G>C (BRCA2 and Ensembl transcript-ID ENST00000544455), referring to the third intronic nucleotide downstream of the splice donor (SD) at the position of the 8754th coding nucleotide. Based on its position information referring to the coding sequence (c.) or alternatively to the genomic (g.) position (e.g. g.1256234A>G), our tool VarCon retrieves an adjustable SNV sequence neighborhood from the reference genome. Both intronic and exonic SNVs can lead to disease by activating cryptic splice sites, generating de novo splice sites or altering usage of physiological splices sites. However, disruption of splicing can also originate from SNVs within splice regulatory elements (SREs), potentially altering their ability to recruit splicing regulatory proteins (SRPs). The capacity of genomic sequences to recruit splicing regulatory proteins to the pre-mRNA transcript can by assessed by the HEXplorer score. Highly positive (negative) HZEI scores indicate sequence segments, which enhance (repress) usage of both downstream 5’ splice sites and upstream 3’ splice sites. . VarCon therefore calculates the HEXplorer score of the retrieved nucleotide sequence with and without the variation, to detect potential crucial changes in the property of the sequences to recruit SRPs. To visualize possible effects of SNVs on splice sites or splicing regulatory elements, which play an increasing role in cancer diagnostics and therapy, VarCon additionally calculates HBond scores of SDs and MaxEnt scores[10] of splice acceptor sites (SA).

Implementation

VarCon is an R package which can be executed from Windows, Linux or Mac OS. It executes a Perl script located in its directory and therefore relies on prior installation of some version of Perl (e.g. Strawberry Perl). Additionally, the human reference genome must be downloaded as fasta file (or zipped fasta.gz) with Ensembl chromosome names (“1” for chromosome 1) and subsequently uploaded into the R working environment, using the function “prepareReferenceFasta” to generate a large DNAStringset (file format of the R package Biostrings). In order to translate SNV positional information, referring to the coding sequence of a transcript, two transcript tables are pre-loaded with the VarCon package. Both contain exon and coding sequence coordinates of every transcript from Ensembl, and refer either to the genome assembly GRCh37 or GRCh38. Since the transcript table with the GRCh38 genomic coordinates (currently from Ensembl version 100) will be updated with further releases, a new transcript table can be downloaded using the Ensembl Biomart interface. Any newly generated transcript table, however, must contain the same columns and column names as described in the documentation of the current transcript tables for correct integration. Since, for instance, in cancer research the transcript which is used to refer to genomic positions of SNVs is often the same, a gene-to-transcript conversion table can be used for synonymous usage of certain gene names (or gene IDs) and transcript IDs (Ensembl ID). VarCon deliberately does not rely on Biomart queries using the Biomart R package, since these might be blocked by firewalls. Due to its structure, the VarCon package can accept any genome and transcript table combination which is available on Ensembl and thus additionally permits usage for any other organism represented in the Ensembl database[11]. The combination of already existing tools like Mutalyzer[12], SeqTailor[13] or ensembldb[14] can lead to similar results during the variation conversion and DNA sequence extraction. However, VarCon holds additional benefits, namely its straightforward usage even on a large-throughput scale, its independence due to the direct data entry and its instant graphical representation of splicing regulatory elements and intrinsic splice site strength.

After upload of the human reference genome, selection of the appropriate transcript table and a potential gene-to-transcript conversion table, a transcript ID (or gene name) and an SNV (whose positional information either refers to the coding (“c.”) or genomic (“g.”) sequence) are requested during the execution of the main function of the package. VarCon then uses the information of the transcripts’ exon coordinates to translate the SNV positional information to a genomic coordinate, if needed. Then the genomic sequence around the SNV position is retrieved from the reference genome in the direction of the open reading frame and committed to further analysis, both with and without the SNV.

For analysis of an SNV impact on splicing regulatory elements, VarCon calculates the HZEI score profile of reference and SNV sequences from the HEXplorer algorithm[7] and visualizes both in a bar plot. The HEXplorer score assesses splicing regulatory properties of genomic sequences, their capacity to recruit splicing regulatory proteins to the pre-mRNA transcript. Highly positive (negative) HZEI scores indicate sequence segments, which enhance (repress) usage of both downstream 5’ splice sites and upstream 3’ splice sites.

Additionally, intrinsic strengths of SD and SA sites are visualized within the HZEI score plot. SD strength is calculated by the HBond score, based on hydrogen bonds formed between a potential SD sequence and all 11 nucleotides of the free 5’ end of the U1 snRNA. SA strength is calculated by the MaxEnt score, which is essentially based on the observed distribution of splice acceptor sequences within the reference genome, while also taking into account dependencies between both non-neighboring and neighboring nucleotide positions[10].

VarCon can either be executed using integrated R package functions according to the manual on github, or with a GUI application based on R package shiny. The shiny app (app.R) can be found within the package directory “/VarCon/shiny/”. To provide the data needed by VarCon within the shiny app, the working directory has to be changed to the app.R source file location prior to starting the shiny application.

Applying VarCon to an SNV

The main function of the VarCon package is the getSeqInfoFromVariation function, which requires the following input parameters: a DNAStringSet of the reference genome (e.g. loaded with the integrated prepareReferenceFasta function), the Ensembl transcript ID, the SNV annotation (either refering to the coding sequence or genomic sequence), the size of the sequence surrounding which should be reported, the transcript table and optionally a gene-to-transcript conversion table.

First the needed variables are defined, namely the transcript table transcriptTable, which provides the information needed to translate SNV coordinates which refer to the coding sequence to genomic coordinates. transcriptID provides the respective transcript, the SNV is refering to. The variable variation holds the actual single nucleotide variation, whose sequence surrounding we will try to retrieve.

library(VarCon)
## Loading required package: Biostrings
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
##     as.data.frame, basename, cbind, colnames, dirname, do.call,
##     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
##     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
##     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
##     tapply, union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Loading required package: stats4
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     I, expand.grid, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: BSgenome
## Loading required package: GenomicRanges
## Loading required package: rtracklayer
## Defining exemplary input data
transcriptTable <- transCoord
transcriptID <- "pseudo_ENST00000650636"
variation <- "c.412T>G/p.(T89M)"

With the variables set, the function getSeqInfoFromVariation can now be used to retrieve information about the SNV, like position and surrounding reference sequence. As an input the function requires a DNAStringSet of the reference genome, a fitting transcript table referring to the same reference genome assembly, the actual single nucleotide variation of interest refering either to the genomic (g.) or coding (c.) position, the respective transcript ID, and the size of the sequence window around the SNV.

library(VarCon)
results <- getSeqInfoFromVariation(referenceDnaStringSet, transcriptID,
variation, ntWindow=20, transcriptTable)
results
## $transcript
## [1] "pseudo_ENST00000650636"
## 
## $funcAnnotation
## [1] "c.412T>G/p.(T89M)"
## 
## $ref_nuc
## [1] "T"
## 
## $alt_nuc
## [1] "G"
## 
## $genomicCoordinate
## [1] 5914
## 
## $sequence
## [1] "GGAAAGCTGAACATTGCTCATTGTGCTGCACAATTCGGGGG"
## 
## $altSeq
## [1] "GGAAAGCTGAACATTGCTCAGTGTGCTGCACAATTCGGGGG"
## 
## $genomic_range
## [1] "5894 5934"

The resulting list results holds 8 named elements, like the transcript ID, the variation, the genomic coordinate and surrounding sequences with and without the SNV. The elements $ref_nuc and $alt_nuc state the nucleotide at the SNV position on the strand the respective transcript is encoded.

In case the user would like to enter gene names instread of transcript IDs, during repeated entries of the same transcript ID for every gene, a gene2transcript conversion table can be provided the the getSeqInfoFromVariation function. Here we first define the gene name, which we want to use instead of the previously entered pseudo-transcript ID. Now we define the gene2transcript table by generating a data frame with the gene name, the gene ID and the transcript ID.

## Define gene 2 transcript table
geneName <- "Example_gene"
gene2transcript <- data.frame(gene_name = "Example_gene",
gene_ID = "pseudo_ENSG00000147099", transcriptID = "pseudo_ENST00000650636")

Only changing the entered transcript name to a gene name and defining a gene2transcript conversion table, enables to use the getSeqInfoFromVariation with gene names, in case the same transcript ID is used as a reference for a specific gene.

## Use function with gene name
results <- getSeqInfoFromVariation(referenceDnaStringSet, geneName,
variation, ntWindow=20, transcriptTable, gene2transcript=gene2transcript)
results
## $transcript
## [1] "pseudo_ENST00000650636"
## 
## $funcAnnotation
## [1] "c.412T>G/p.(T89M)"
## 
## $ref_nuc
## [1] "T"
## 
## $alt_nuc
## [1] "G"
## 
## $genomicCoordinate
## [1] 5914
## 
## $sequence
## [1] "GGAAAGCTGAACATTGCTCATTGTGCTGCACAATTCGGGGG"
## 
## $altSeq
## [1] "GGAAAGCTGAACATTGCTCAGTGTGCTGCACAATTCGGGGG"
## 
## $genomic_range
## [1] "5894 5934"

The resulting list holds the same information as in the example above.

The results object can now be visualized using the function generateHEXplorerPlot which will generate a HEXplorer plot stating the HEXplorer profile of the nucleotide surrounding and the strength of surrouning splice sites.

VarCon can alternatively be used as an shiny user interface using the startVarConApp().

Session info

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] VarCon_1.8.0         BSgenome_1.68.0      rtracklayer_1.60.0  
##  [4] GenomicRanges_1.52.0 Biostrings_2.68.0    GenomeInfoDb_1.36.0 
##  [7] XVector_0.40.0       IRanges_2.34.0       S4Vectors_0.38.0    
## [10] BiocGenerics_0.46.0 
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.30.0 gtable_0.3.3               
##  [3] rjson_0.2.21                xfun_0.39                  
##  [5] bslib_0.4.2                 ggplot2_3.4.2              
##  [7] Biobase_2.60.0              lattice_0.21-8             
##  [9] vctrs_0.6.2                 tools_4.3.0                
## [11] bitops_1.0-7                generics_0.1.3             
## [13] parallel_4.3.0              tibble_3.2.1               
## [15] fansi_1.0.4                 pkgconfig_2.0.3            
## [17] Matrix_1.5-4                lifecycle_1.0.3            
## [19] GenomeInfoDbData_1.2.10     compiler_4.3.0             
## [21] Rsamtools_2.16.0            munsell_0.5.0              
## [23] codetools_0.2-19            httpuv_1.6.9               
## [25] htmltools_0.5.5             sass_0.4.5                 
## [27] RCurl_1.98-1.12             yaml_2.3.7                 
## [29] later_1.3.0                 pillar_1.9.0               
## [31] crayon_1.5.2                jquerylib_0.1.4            
## [33] ellipsis_0.3.2              BiocParallel_1.34.0        
## [35] DelayedArray_0.26.0         cachem_1.0.7               
## [37] mime_0.12                   tidyselect_1.2.0           
## [39] digest_0.6.31               dplyr_1.1.2                
## [41] restfulr_0.0.15             fastmap_1.1.1              
## [43] grid_4.3.0                  colorspace_2.1-0           
## [45] cli_3.6.1                   magrittr_2.0.3             
## [47] XML_3.99-0.14               utf8_1.2.3                 
## [49] promises_1.2.0.1            scales_1.2.1               
## [51] rmarkdown_2.21              matrixStats_0.63.0         
## [53] shiny_1.7.4                 evaluate_0.20              
## [55] knitr_1.42                  shinycssloaders_1.0.0      
## [57] BiocIO_1.10.0               rlang_1.1.0                
## [59] Rcpp_1.0.10                 xtable_1.8-4               
## [61] glue_1.6.2                  jsonlite_1.8.4             
## [63] R6_2.5.1                    fs_1.6.2                   
## [65] shinyFiles_0.9.3            MatrixGenerics_1.12.0      
## [67] GenomicAlignments_1.36.0    zlibbioc_1.46.0