Package: MsBackendSql
Authors: Johannes Rainer [aut, cre] (https://orcid.org/0000-0002-6977-7147),
Chong Tang [ctb],
Laurent Gatto [ctb] (https://orcid.org/0000-0002-1520-2268)
Compiled: Thu May 11 20:23:36 2023
The Spectra Bioconductor package provides a flexible and
expandable infrastructure for Mass Spectrometry (MS) data. The package supports
interchangeable use of different backends that provide additional file support
or different ways to store and represent MS data. The
MsBackendSql package provides backends to store data from whole
MS experiments in SQL databases. The data in such databases can be easily (and
efficiently) accessed using Spectra
objects that use the MsBackendSql
class
as an interface to the data in the database. Such Spectra
objects have a
minimal memory footprint and hence allow analysis of very large data sets even
on computers with limited hardware capabilities. For certain operations, the
performance of this data representation is superior to that of other low-memory
(on-disk) data representations such as Spectra
’s MsBackendMzR
backend.
Finally, the MsBackendSql
supports also remote data access to e.g. a central
database server hosting several large MS data sets.
The package can be installed with the BiocManager
package. To install
BiocManager
use install.packages("BiocManager")
and, after that,
BiocManager::install("MsBackendSql")
to install this package.
MsBackendSql
SQL databasesMsBackendSql
databases can be created either by importing (raw) MS data from
MS data files using the createMsBackendSqlDatabase
or using the
backendInitialize
function by providing in addition to the database connection
also the full MS data to import as a DataFrame
. In the first example we use
the createMsBackendSqlDatabase
function which takes a connection to an (empty)
database and the names of the files from which the data should be imported as
input parameters creates all necessary database tables and stores the full data
into the database. Below we create an empty SQLite database (in a temporary
file) and fill that with MS data from two mzML files (from the r Biocpkg("msdata")
package).
library(RSQLite)
dbfile <- tempfile()
con <- dbConnect(SQLite(), dbfile)
library(MsBackendSql)
fls <- dir(system.file("sciex", package = "msdata"), full.names = TRUE)
createMsBackendSqlDatabase(con, fls)
By default the m/z and intensity values are stored as BLOB data types in the database. This has advantages on the performance to extract peaks data from the database but would for example not allow to filter peaks by m/z values directly in the database. As an alternative it is also possible to the individual m/z and intensity values in separate rows of the database table. This long table format results however in considerably larger databases (with potentially poorer performance). Note also that the code and backend is optimized for MySQL/MariaDB databases by taking advantage of table partitioning and specialized table storage options. Any other SQL database server is however also supported (also portable, self-contained SQLite databases).
The MsBackendSql package provides two backends to interact with such
databases: the (default) MsBackendSql
class and the MsBackendOfflineSql
,
that inherits all properties and functions from the former, but which does not
store the connection to the database within the object but connects (and
disconnects) to (and from) the database in each function call. This allows to
use the latter also for parallel processing setups.
To access the data in the database we create below a Spectra
object providing
the connection to the database in the constructor call and specifying to use the
MsBackendSql
as backend using the source
parameter.
sps <- Spectra(con, source = MsBackendSql())
sps
## MSn data (Spectra) with 1862 spectra in a MsBackendSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /var/folders/v1/y6dg5h4n163dzmrfl6t_r5480000gp/T//RtmpP0ffXU/file5ca11c57b001
Spectra
objects allow also to change the backend to any other backend
(extending MsBackend
) using the setBackend
function. Below we use this
function to first load all data into memory by changing from the MsBackendSql
to a MsBackendMemory
.
sps_mem <- setBackend(sps, MsBackendMemory())
sps_mem
## MSn data (Spectra) with 1862 spectra in a MsBackendMemory backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 1 0.280 1
## 2 1 0.559 2
## 3 1 0.838 3
## 4 1 1.117 4
## 5 1 1.396 5
## ... ... ... ...
## 1858 1 258.636 927
## 1859 1 258.915 928
## 1860 1 259.194 929
## 1861 1 259.473 930
## 1862 1 259.752 931
## ... 34 more variables/columns.
## Processing:
## Switch backend from MsBackendSql to MsBackendMemory [Thu May 11 20:23:55 2023]
With this function it is also possible to change from any backend to a
MsBackendSql
in which case a new database is created and all data from the
originating backend is stored in this database. We thus have to provide in
addition also a connection to an (empty) database using the dbcon
parameter. Below we create a new empty SQLite database and store all data
from the Spectra
object into this database using the setBackend
method.
tmpcon <- dbConnect(SQLite(), tempfile())
sps2 <- setBackend(sps_mem, MsBackendSql(), dbcon = tmpcon)
## Warning in .create_from_spectra_data(dbcon, data = data, ...): Replacing
## original column "spectrum_id_"
sps2
## MSn data (Spectra) with 1862 spectra in a MsBackendSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /var/folders/v1/y6dg5h4n163dzmrfl6t_r5480000gp/T//RtmpP0ffXU/file5ca17a57be9a
## Processing:
## Switch backend from MsBackendSql to MsBackendMemory [Thu May 11 20:23:55 2023]
## Switch backend from MsBackendMemory to MsBackendSql [Thu May 11 20:23:58 2023]
Similar to any other Spectra
object we can retrieve the available spectra
variables using the spectraVariables
function.
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "peaksCount"
## [19] "totIonCurrent" "basePeakMZ"
## [21] "basePeakIntensity" "ionisationEnergy"
## [23] "lowMZ" "highMZ"
## [25] "mergedScan" "mergedResultScanNum"
## [27] "mergedResultStartScanNum" "mergedResultEndScanNum"
## [29] "injectionTime" "filterString"
## [31] "spectrumId" "ionMobilityDriftTime"
## [33] "scanWindowLowerLimit" "scanWindowUpperLimit"
## [35] "spectrum_id_"
The MS peak data can be accessed using either the mz
, intensity
or
peaksData
functions. Below we extract the peaks matrix of the 5th spectrum and
display the first 6 rows.
peaksData(sps)[[5]] |>
head()
## mz intensity
## [1,] 105.0347 0
## [2,] 105.0362 164
## [3,] 105.0376 0
## [4,] 105.0391 0
## [5,] 105.0405 328
## [6,] 105.0420 0
All data (peaks data or spectra variables) are always retrieved on the fly
from the database resulting thus in a minimal memory footprint for the Spectra
object.
print(object.size(sps), units = "KB")
## 91.4 Kb
The backend supports also adding additional spectra variables or changing their values. Below we add 10 seconds to the retention time of each spectrum.
sps$rtime <- sps$rtime + 10
Such operations do however not change the data in the database (which is always considered read-only) but are cached locally within the backend object (in memory). The size in memory of the object is thus higher after changing that spectra variable.
print(object.size(sps), units = "KB")
## 106.1 Kb
Such $<-
operations can also be used to cache spectra variables
(temporarily) in memory which can eventually improve performance. Below we test
the time it takes to extract the MS level from each spectrum from the database,
then cache the MS levels in memory using $msLevel <-
and test the timing to
extract these cached variable.
system.time(msLevel(sps))
## user system elapsed
## 0.028 0.000 0.059
sps$msLevel <- msLevel(sps)
system.time(msLevel(sps))
## user system elapsed
## 0.007 0.000 0.033
We can also use the reset
function to reset the data to its original state
(this will cause any local spectra variables to be deleted and the backend to be
initialized with the original data in the database).
sps <- reset(sps)
To use the MsBackendOfflineSql
backend we need to provide all information
required to connect to the database along with the database driver to the
Spectra
function. Which parameters are required to connect to the database
depends on the SQL database and the used driver. In our example the data is
stored in a SQLite database, hence we use the SQLite()
database driver and
only need to provide the database name with the dbname
parameter. For a
MySQL/MariaDB database we would use the MariaDB()
driver and would have to
provide the database name, user name, password as well as the host name and port
through which the database is accessible.
sps_off <- Spectra(SQLite(), dbname = dbfile,
source = MsBackendOfflineSql())
sps_off
## MSn data (Spectra) with 1862 spectra in a MsBackendOfflineSql backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 1 NA 1
## 2 1 NA 1
## 3 1 NA 1
## 4 1 NA 1
## 5 1 NA 1
## ... ... ... ...
## 1858 1 NA 1
## 1859 1 NA 1
## 1860 1 NA 1
## 1861 1 NA 1
## 1862 1 NA 1
## ... 34 more variables/columns.
## Use 'spectraVariables' to list all of them.
## Database: /private/var/folders/v1/y6dg5h4n163dzmrfl6t_r5480000gp/T/RtmpP0ffXU/file5ca11c57b001
This backend provides the exact same functionality than MsBackendSql
with the
difference that the connection to the database is opened and closed for each
function call. While this leads to a slightly lower performance, it allows to
use the backend (and hence the Spectra
object) also in a parallel processing
setup. In contrast, for the MsBackendSql
parallel processing is disabled since
it is not possible to share the active backend connection within the object
across different parallel processes.
Below we compare the performance of the two backends. The performance difference is the result from opening and closing the database connection for each call. Note that this will also depend on the SQL server that is being used. For SQLite databases there is almost no overhead.
library(microbenchmark)
microbenchmark(msLevel(sps), msLevel(sps_off))
## Unit: milliseconds
## expr min lq mean median uq max neval
## msLevel(sps) 23.34557 28.00602 42.89684 36.68822 46.89810 189.6362 100
## msLevel(sps_off) 31.32622 36.77976 57.20382 48.35521 63.70515 201.8759 100
## cld
## a
## b
The need to retrieve any spectra data on-the-fly from the database will have an
impact on the performance of data access function of Spectra
objects using the
MsBackendSql
backends. To evaluate its impact we next compare the performance
of the MsBackendSql
to other Spectra
backends, specifically, the
MsBackendMzR
which is the default backend to read and represent raw MS data,
and the MsBackendMemory
backend that keeps all MS data in memory (and is thus
not suggested for larger MS experiments). Similar to the MsBackendMzR
, also
the MsBackendSql
keeps only a limited amount of data in memory. These
on-disk backends need thus to retrieve spectra and MS peaks data on-the-fly
from either the original raw data files (in the case of the MsBackendMzR
) or
from the SQL database (in the case of the MsBackendSql
). The in-memory backend
MsBackendMemory
is supposed to provide the fastest data access since all data
is kept in memory.
Below we thus create Spectra
objects from the same data but using the
different backends.
sps <- Spectra(con, source = MsBackendSql())
sps_mzr <- Spectra(fls, source = MsBackendMzR())
sps_im <- setBackend(sps_mzr, backend = MsBackendMemory())
At first we compare the memory footprint of the 3 backends.
print(object.size(sps), units = "KB")
## 91.4 Kb
print(object.size(sps_mzr), units = "KB")
## 386.2 Kb
print(object.size(sps_im), units = "KB")
## 54494.3 Kb
The MsBackendSql
has the lowest memory footprint of all 3 backends because it
does not keep any data in memory. The MsBackendMzR
keeps all spectra
variables, except the MS peaks data, in memory and has thus a larger size. The
MsBackendMemory
keeps all data (including the MS peaks data) in memory and has
thus the largest size in memory.
Next we compare the performance to extract the MS level for each spectrum from
the 4 different Spectra
objects.
library(microbenchmark)
microbenchmark(msLevel(sps),
msLevel(sps_mzr),
msLevel(sps_im))
## Unit: microseconds
## expr min lq mean median uq max
## msLevel(sps) 23602.343 27876.1640 30496.001 30271.138 33099.138 39370.202
## msLevel(sps_mzr) 1366.425 1868.8100 2390.003 2094.051 2429.880 5668.577
## msLevel(sps_im) 58.920 76.7135 103.884 98.717 112.477 914.237
## neval cld
## 100 a
## 100 b
## 100 c
Extracting MS levels is thus slowest for the MsBackendSql
, which is not
surprising because both other backends keep this data in memory while the
MsBackendSql
needs to retrieve it from the database.
We next compare the performance to access the full peaks data from each
Spectra
object.
microbenchmark(peaksData(sps, BPPARAM = SerialParam()),
peaksData(sps_mzr, BPPARAM = SerialParam()),
peaksData(sps_im, BPPARAM = SerialParam()), times = 10)
## Unit: milliseconds
## expr min lq mean
## peaksData(sps, BPPARAM = SerialParam()) 378.19393 473.789784 983.61916
## peaksData(sps_mzr, BPPARAM = SerialParam()) 1573.70589 1584.610217 2430.06834
## peaksData(sps_im, BPPARAM = SerialParam()) 7.40084 9.036435 22.24775
## median uq max neval cld
## 840.4625 1044.28992 2785.89691 10 a
## 1640.5166 3776.36462 4447.14354 10 b
## 10.6518 20.66029 69.38412 10 c
As expected, the MsBackendMemory
has the fasted access to the full peaks
data. The MsBackendSql
outperforms however the MsBackendMzR
providing faster
access to the m/z and intensity values.
Performance can be improved for the MsBackendMzR
using parallel
processing. Note that the MsBackendSql
does not support parallel
processing and thus parallel processing is (silently) disabled in functions such
as peaksData
.
m2 <- MulticoreParam(2)
microbenchmark(peaksData(sps, BPPARAM = m2),
peaksData(sps_mzr, BPPARAM = m2),
peaksData(sps_im, BPPARAM = m2), times = 10)
## Unit: milliseconds
## expr min lq mean
## peaksData(sps, BPPARAM = m2) 437.563502 524.627647 865.398530
## peaksData(sps_mzr, BPPARAM = m2) 1322.504865 1354.501535 2321.328036
## peaksData(sps_im, BPPARAM = m2) 2.102402 2.523905 3.411793
## median uq max neval cld
## 599.73651 1225.136722 1621.316334 10 a
## 1743.57342 3554.452666 3625.028901 10 b
## 2.76047 3.866625 6.875028 10 c
We next compare the performance of subsetting operations.
microbenchmark(filterRt(sps, rt = c(50, 100)),
filterRt(sps_mzr, rt = c(50, 100)),
filterRt(sps_im, rt = c(50, 100)))
## Unit: milliseconds
## expr min lq mean median
## filterRt(sps, rt = c(50, 100)) 9.501400 12.055389 14.433538 13.271598
## filterRt(sps_mzr, rt = c(50, 100)) 8.661695 11.214755 12.509864 12.315546
## filterRt(sps_im, rt = c(50, 100)) 2.106978 2.447407 3.256477 2.656384
## uq max neval cld
## 15.674606 63.333283 100 a
## 13.534529 21.491348 100 b
## 3.403411 8.367706 100 c
The two on-disk backends MsBackendSql
and MsBackendMzR
show a comparable
performance for this operation. This filtering does involves access to a spectra
variables (the retention time in this case) which, for the MsBackendSql
needs
first to be retrieved from the backend. The MsBackendSql
backend allows
however also to cache spectra variables (i.e. they are stored within the
MsBackendSql
object). Any access to such cached spectra variables can
eventually be faster because no dedicated SQL query is needed.
To evaluate the performance of a pure subsetting operation we first define the
indices of 10 random spectra and subset the Spectra
objects to these.
idx <- sample(seq_along(sps), 10)
microbenchmark(sps[idx],
sps_mzr[idx],
sps_im[idx])
## Unit: microseconds
## expr min lq mean median uq max neval
## sps[idx] 499.034 586.8375 928.3438 697.937 756.9765 12392.986 100
## sps_mzr[idx] 2702.745 3046.9940 4033.4506 3214.204 4363.5915 11409.517 100
## sps_im[idx] 917.645 1108.5895 1406.9452 1179.137 1266.4600 3732.415 100
## cld
## a
## b
## c
Here the MsBackendSql
outperforms the other backends because it does not keep
any data in memory and hence does not need to subset these. The two other
backends need to subset the data they keep in memory which is in both cases a
data frame with either a reduced set of spectra variables or the full MS data.
At last we compare also the extraction of the peaks data from the such subset
Spectra
objects.
sps_10 <- sps[idx]
sps_mzr_10 <- sps_mzr[idx]
sps_im_10 <- sps_im[idx]
microbenchmark(peaksData(sps_10),
peaksData(sps_mzr_10),
peaksData(sps_im_10),
times = 10)
## Unit: milliseconds
## expr min lq mean median uq
## peaksData(sps_10) 16.930804 17.514448 21.299980 21.071498 21.602958
## peaksData(sps_mzr_10) 149.482234 150.651700 161.667287 152.655541 157.336109
## peaksData(sps_im_10) 1.860251 2.068115 2.497851 2.397642 3.037444
## max neval cld
## 30.423126 10 a
## 231.780500 10 b
## 3.459612 10 c
The MsBackendSql
outperforms the MsBackendMzR
while, not unexpectedly, the
MsBackendMemory
provides fasted access.
MsBackendSql
The MsBackendSql
backend does not support parallel processing since the
database connection can not be shared across the different (parallel)
processes. Thus, all methods on Spectra
objects that use a MsBackendSql
will
automatically (and silently) disable parallel processing even if a dedicated
parallel processing setup was passed along with the BPPARAM
method.
Some functions on Spectra
objects require to load the MS peak data (i.e., m/z
and intensity values) into memory. For very large data sets (or computers with
limited hardware resources) such function calls can cause out-of-memory
errors. One example is the lengths
function that determines the number of
peaks per spectrum by loading the peak matrix first into memory. Such functions
should ideally be called using the peaksapply
function with parameter
chunkSize
(e.g., peaksapply(sps, lengths, chunkSize = 5000L)
). Instead of
processing the full data set, the data will be first split into chunks of size
chunkSize
that are stepwise processed. Hence, only data from chunkSize
spectra is loaded into memory in one iteration.
The MsBackendSql
provides an MS data representations and storage mode with a
minimal memory footprint (in R) that is still comparably efficient for standard
processing and subsetting operations. This backend is specifically useful for
very large MS data sets, that could even be hosted on remote (MySQL/MariaDB)
servers. A potential use case for this backend could thus be to set up a central
storage place for MS experiments with data analysts connecting remotely to this
server to perform initial data exploration and filtering. After subsetting to a
smaller data set of interest, users could then retrieve/download this data by
changing the backend to e.g. a MsBackendMemory
, which would result in a
download of the full data to the user computer’s memory.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84257)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Monterey 12.6.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] microbenchmark_1.4.10 RSQLite_2.3.1 MsBackendSql_1.0.1
## [4] Spectra_1.10.1 ProtGenerics_1.32.0 BiocParallel_1.34.1
## [7] S4Vectors_0.38.1 BiocGenerics_0.46.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] sandwich_3.0-2 sass_0.4.6 MsCoreUtils_1.12.0
## [4] lattice_0.21-8 hms_1.1.3 digest_0.6.31
## [7] grid_4.3.0 evaluate_0.21 bookdown_0.34
## [10] mvtnorm_1.1-3 fastmap_1.1.1 blob_1.2.4
## [13] Matrix_1.5-4 jsonlite_1.8.4 progress_1.2.2
## [16] mzR_2.34.0 DBI_1.1.3 survival_3.5-5
## [19] multcomp_1.4-23 BiocManager_1.30.20 TH.data_1.1-2
## [22] codetools_0.2-19 jquerylib_0.1.4 cli_3.6.1
## [25] rlang_1.1.1 crayon_1.5.2 Biobase_2.60.0
## [28] splines_4.3.0 bit64_4.0.5 cachem_1.0.8
## [31] yaml_2.3.7 tools_4.3.0 parallel_4.3.0
## [34] memoise_2.0.1 ncdf4_1.21 vctrs_0.6.2
## [37] R6_2.5.1 zoo_1.8-12 lifecycle_1.0.3
## [40] fs_1.6.2 IRanges_2.34.0 bit_4.0.5
## [43] clue_0.3-64 MASS_7.3-60 cluster_2.1.4
## [46] pkgconfig_2.0.3 bslib_0.4.2 data.table_1.14.8
## [49] Rcpp_1.0.10 xfun_0.39 knitr_1.42
## [52] htmltools_0.5.5 rmarkdown_2.21 compiler_4.3.0
## [55] prettyunits_1.1.1