alabaster.base 1.0.0
Developers can easily extend this framework to support more R/Bioconductor classes by creating their own alabaster package. This involves defining a schema, creating a staging function and creating a loading function. Readers are assumed to know the basics of R package development.
To demonstrate, let’s use the dgTMatrix
classes from the Matrix package.
In this vignette, we’ll be creating an alabaster.foo package to save and load these objects to/from file.
This involves some initial steps:
ArtifactDB/BiocObjectSchemas
repository.raw/
with the name of your schema.v1.json
.Our new schema will be called sparse_triplet_matrix
:
{
"$schema": "http://json-schema.org/draft-07/schema",
"$id": "sparse_triplet_matrix/v1.json",
"title": "Sparse Triplet Matrix",
"type": "object",
"description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.",
"allOf": [
{ "$ref": "../array/v1.json" },
{ "$ref": "../_md5sum/v1.json" }
],
"properties": {
"sparse_triplet_matrix": {
"type": "object",
"properties": {
"num_elements": {
"type": "integer",
"description": "Number of non-zero elements in the sparse matrix.",
"minimum": 0
}
},
"required": [ "num_elements" ],
"allOf": [ { "$ref": "../_compression/v1.json" } ]
}
},
"required": [ "sparse_triplet_matrix" ],
"_attributes": {
"format": "text/csv",
"restore": {
"R": "alabaster.foo::loadSparseTripletMatrix"
}
}
}
This involves a few choices:
array/v1.json
schema.
This allows the sparse_triplet_matrix
to be used inside objects that expect arrays, e.g., SummarizedExperiment
s.
It also provides a standard place to store the array dimensions and type.sparse_triplet_matrix
property, just in case anyone cares.
We’ll also report the compression method used for the CSV file.text/csv
type in the _attributes.format
property._attributes.restore.R
property specifies the namespaced function to be used to restore the object in an R session.Once we’re happy, we can run the scripts/resolver.py
script to resolve all the $ref
references.
python3 scripts/resolver.py raw resolved
This creates a resolved/sparse_triplet_matrix/v1.json
file that looks like:
{
"$id": "sparse_triplet_matrix/v1.json",
"$schema": "http://json-schema.org/draft-07/schema",
"_attributes": {
"format": "text/csv",
"restore": {
"R": "alabaster.foo::loadSparseTripletMatrix"
}
},
"description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.\n\nDerived from `array/v1.json`: some kind of multi-dimensional array, where we store metadata about the dimensions and type of data. The exact implementation of the array is left to concrete subclasses.",
"properties": {
"$schema": {
"description": "The schema to use.",
"type": "string"
},
"array": {
"additionalProperties": false,
"properties": {
"dimensions": {
"description": "Dimensions of an n-dimensional array. Dimensions should be ordered from the fastest-changing to the slowest.",
"items": {
"type": "integer"
},
"minItems": 1,
"type": "array"
},
"type": {
"description": "Type of data stored in this array.",
"enum": [
"boolean",
"number",
"integer",
"string",
"other"
],
"type": "string"
}
},
"required": [
"dimensions"
],
"type": "object"
},
"is_child": {
"default": false,
"description": "Is this a child document, only to be interpreted in the context of the parent document from which it is linked? This may have implications for search and metadata requirements.",
"type": "boolean"
},
"md5sum": {
"description": "MD5 checksum for the file.",
"type": "string"
},
"path": {
"description": "Path to the file in the project directory.",
"type": "string"
},
"sparse_triplet_matrix": {
"properties": {
"compression": {
"description": "Type of compression applied to the file.",
"enum": [
"none",
"gzip",
"bzip2"
],
"type": "string"
},
"num_elements": {
"description": "Number of non-zero elements in the sparse matrix.",
"minimum": 0,
"type": "integer"
}
},
"required": [
"num_elements"
],
"type": "object"
}
},
"required": [
"$schema",
"array",
"md5sum",
"path",
"sparse_triplet_matrix"
],
"title": "Sparse Triplet Matrix ",
"type": "object"
}
Let’s put this file in the inst/schemas/sparse_triplet_matrix
subdirectory of alabaster.foo.
We need to create a method for the stageObject()
generic that operates on the class of interest.
This will be called whenever an instance of that class is encountered by stageObject()
, either directly or as a child of another object (e.g., a SummarizedExperiment
’s assay).
library(Matrix)
library(alabaster.base)
library(S4Vectors)
setMethod("stageObject", "dgTMatrix", function(x, dir, path, child=FALSE) {
# Create a subdirectory to stash our contents.
dir.create(file.path(dir, path), showWarnings=FALSE)
# Create a DataFrame with the triplet data.
df <- DataFrame(i = x@i, j = x@j, x = x@x)
# .quickWriteCsv will make sure it's written in an 'alabaster-standard' format.
outpath <- file.path(path, "foo.csv.gz")
.quickWriteCsv(df, file.path(dir, outpath), compression="gzip")
# Specifying the package name in the package attribute of the schema,
# to ensure that .writeMetadata() can find it for validation.
schema <- "sparse_triplet_matrix/v1.json"
attr(schema, "package") <- "alabaster.foo"
# Formatting the metadata for return.
list(
`$schema`=schema,
# Reported path must be relative to 'dir'.
path=outpath,
# Pass along the 'child' specification from the call.
is_child=child,
`array`=list(
# Need I() to prevent unboxing of length-1 vectors.
dimensions=I(dim(x)),
# double-precision values => 'number' in JSON schema's language.
type="number"
),
sparse_triplet_matrix=list(
num_elements=nrow(df),
compression="gzip"
)
)
})
Alright, let’s test this out with a mock dgTMatrix
.
x <- sparseMatrix(
i=c(1,2,3,5,6),
j=c(3,6,1,3,8),
x=runif(5),
dims=c(10, 10),
repr="T"
)
x
## 10 x 10 sparse Matrix of class "dgTMatrix"
##
## [1,] . . 0.4660350 . . . . . . .
## [2,] . . . . . 0.7813227 . . . .
## [3,] 0.273472 . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . 0.7579172 . . . . . . .
## [6,] . . . . . . . 0.521723 . .
## [7,] . . . . . . . . . .
## [8,] . . . . . . . . . .
## [9,] . . . . . . . . . .
## [10,] . . . . . . . . . .
tmp <- tempfile()
dir.create(tmp)
meta <- stageObject(x, tmp, "test")
str(meta)
## List of 5
## $ $schema : chr "sparse_triplet_matrix/v1.json"
## ..- attr(*, "package")= chr "alabaster.foo"
## $ path : chr "test/foo.csv.gz"
## $ is_child : logi FALSE
## $ array :List of 2
## ..$ dimensions: 'AsIs' int [1:2] 10 10
## ..$ type : chr "number"
## $ sparse_triplet_matrix:List of 2
## ..$ num_elements: int 5
## ..$ compression : chr "gzip"
Running .writeMetadata()
will then validate meta
against the schema requirements, using the schema file that we put inside alabaster.foo.
Finally, for loading, we need to define a new load*
function for the new class that recreates the R object from the staged files.
In this case, it’s fairly simple:
loadSparseTripletMatrix <- function(info, project) {
# Need to get the file path.
path <- acquireFile(project, info$path)
# This utility will check that the CSV is correctly formatted,
# which is more stringent than read.csv.
df <- .quickReadCsv(path,
expected.columns=c(i="integer", j="integer", x="double"),
expected.nrows=info$sparse_triplet_matrix$num_elements,
compression=info$sparse_triplet_matrix$compression,
row.names=FALSE
)
# Constructor uses 1-based indices.
sparseMatrix(
i=df$i + 1L,
j=df$j + 1L,
x=df$x,
dims=info$array$dimensions,
repr="T"
)
}
Let’s try it out with our previously staged dgTMatrix
:
loadSparseTripletMatrix(meta, tmp)
## 10 x 10 sparse Matrix of class "dgTMatrix"
##
## [1,] . . 0.4660350 . . . . . . .
## [2,] . . . . . 0.7813227 . . . .
## [3,] 0.273472 . . . . . . . . .
## [4,] . . . . . . . . . .
## [5,] . . 0.7579172 . . . . . . .
## [6,] . . . . . . . 0.521723 . .
## [7,] . . . . . . . . . .
## [8,] . . . . . . . . . .
## [9,] . . . . . . . . . .
## [10,] . . . . . . . . . .
To register this method with alabaster.base, alabaster.foo should add itself to the known schema locations upon load:
# Typically in zzz.R.
.onLoad <- function(libname, pkgname) {
existing <- options("alabaster.schema.locations")
options(alabaster.schema.locations=union("alabaster.foo", existing))
}
.onUnload <- function(libpath) {
existing <- options("alabaster.schema.locations")
options(alabaster.schema.locations=setdiff(existing, "alabaster.foo"))
}
This ensures that loadObject()
can find the resolved schema file in alabaster.foo’s inst/schemas
when it encounters an instance of a sparse_triplet_matrix/v1.json
.
From the schema, loadObject()
will look at the ’s _attributes.restore.R
property to figure out how to load it into R, which will eventually lead to calling loadSparseTripletMatrix()
.
Some staging methods may need to save child objects, represented by the resource pointers in the schema.
For example, a SummarizedExperiment
would contain multiple children - each assay, the DataFrame
s for the column and row annotations, the metadata list, and so on.
These are easily handled by just calling the staging methods of each child object inside the staging method for the parent object:
# Abbreviated example from artificer.se:
setMethod("stageObject", "SummarizedExperiment", function(x, dir, path, child=FALSE) {
dir.create(file.path(dir, path), showWarnings=FALSE)
# Saving the colData.
info <- .stageObject(colData(x), dir, file.path(path, "coldata"), child=TRUE)
cd.info <- list(resource=.writeMetadata(info, dir=dir))
# Saving the rowData.
info <- .stageObject(rowData(x), dir, file.path(path, "rowdata"), child=TRUE)
rd.info <- list(resource=.writeMetadata(info, dir=dir))
# Saving the other metadata.
info <- .stageObject(metadata(x), dir, file.path(path, "metadata"), child=TRUE)
other.info <- list(resource=.writeMetadata(info, dir=dir))
# Saving the assays.
assay.info <- list()
for (a in assayNames(x)) {
curmat <- assay(x, a)
mat.path <- file.path(path, paste0("assay-", i))
meta <- .stageObject(curmat, path=mat.path, dir=dir, child=TRUE)
deets <- .writeMetadata(meta, dir=dir)
assay.info <- c(assay.info, list(list(name=ass.names[i], resource=deets)))
}
list(
`$schema`="summarized_experiment/v1.json",
path=file.path(path, meta.name),
summarized_experiment=list(
assays=assay.info,
column_data=cd.info,
row_data=rd.info,
other_data=meta.info,
dimensions=dim(x)
),
is_child=child
)
})
When doing so, developers should use .stageObject()
instead of calling stageObject()
(note the period).
The is a wrapper around ensures that the new staging method can respect application-specific overrides.
Similarly, some loading methods should use .loadObject()
when loading child objects:
# Abbreviated example from artificer.se:
loadSummarizedExperiment <- function(exp.info, project) {
all.assays <- list()
for (y in seq_along(exp.info$summarized_experiment$assays)) {
cur.ass <- exp.info$summarized_experiment$assays[[y]]
aname <- cur.ass$name
apath <- cur.ass$resource$path
ass.info <- acquireMetadata(project, apath)
all.assays[[aname]] <- .loadObject(ass.info, project=project)
}
cd.info <- acquireMetadata(project, exp.info$summarized_experiment$column_data$resource$path)
cd <- .loadObject(cd.info, project=project)
rd.info <- acquireMetadata(project, exp.info$summarized_experiment$row_data$resource$path)
rd <- .loadObject(rd.info, project=project)
other.info <- acquireMetadata(project, exp.info$summarized_experiment$other_data$resource$path)
other <- .loadObject(other.info, project=project)
SummarizedExperiment(all.assays, colData=cd, rowData=rd, metadata=other, checkDimnames=FALSE)
}
Inside the loading functions, we always use acquireMetadata()
and acquireFile()
to obtain the metadata and file paths, respectively.
This ensures that we respect any application-specific overrides, especially for project
values that are not file paths.
If the new schema refers to a standard Bioconductor object that is widely re-usable,
you may consider submitting a PR with the (non-resolved) schema to ArtifactDB/BiocObjectSchemas
.
If accepted, we will include your schema into the standard set of schemas known to alabaster.base. Once this is done, alabaster.foo can be simplified considerably:
.onLoad()
functions to register alabaster.foo are no longer required.$schema
string does not need a package
attribute in the stageObject()
method.inst/schemas
.You may also wish to create a PR to alabaster.base to add your class and package to the staging search path.
This will prompt stageObject()
to attempt to load your package upon encountering an instance of the class of interest,
ensuring that the correct staging method is automatically called even if the user hasn’t loaded your package.
sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] S4Vectors_0.38.0 BiocGenerics_0.46.0 Matrix_1.5-4
## [4] alabaster.base_1.0.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] cli_3.6.1 knitr_1.42 rlang_1.1.0
## [4] xfun_0.39 DelayedArray_0.26.0 jsonlite_1.8.4
## [7] V8_4.3.0 htmltools_0.5.5 MatrixGenerics_1.12.0
## [10] sass_0.4.5 rmarkdown_2.21 grid_4.3.0
## [13] evaluate_0.20 jquerylib_0.1.4 fastmap_1.1.1
## [16] IRanges_2.34.0 yaml_2.3.7 alabaster.schemas_1.0.0
## [19] Rhdf5lib_1.22.0 bookdown_0.33 jsonvalidate_1.3.2
## [22] BiocManager_1.30.20 compiler_4.3.0 Rcpp_1.0.10
## [25] rhdf5filters_1.12.0 lattice_0.21-8 rhdf5_2.44.0
## [28] digest_0.6.31 R6_2.5.1 curl_5.0.0
## [31] HDF5Array_1.28.0 bslib_0.4.2 tools_4.3.0
## [34] matrixStats_0.63.0 alabaster.matrix_1.0.0 cachem_1.0.7