Contents

1 Overview

Developers can easily extend this framework to support more R/Bioconductor classes by creating their own alabaster package. This involves defining a schema, creating a staging function and creating a loading function. Readers are assumed to know the basics of R package development.

To demonstrate, let’s use the dgTMatrix classes from the Matrix package. In this vignette, we’ll be creating an alabaster.foo package to save and load these objects to/from file.

2 Defining a schema

This involves some initial steps:

  1. Clone the ArtifactDB/BiocObjectSchemas repository.
  2. Read this document for the schema conventions.
  3. Create a new subdirectory inside raw/ with the name of your schema.
  4. Create a schema file with the specified version, usually v1.json.

Our new schema will be called sparse_triplet_matrix:

{
    "$schema": "http://json-schema.org/draft-07/schema",
    "$id": "sparse_triplet_matrix/v1.json",
    "title": "Sparse Triplet Matrix",
    "type": "object",
    "description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.",
    "allOf": [
        { "$ref": "../array/v1.json" },
        { "$ref": "../_md5sum/v1.json" }
    ],
    "properties": {
        "sparse_triplet_matrix": {
            "type": "object",
            "properties": {
                "num_elements": {
                    "type": "integer",
                    "description": "Number of non-zero elements in the sparse matrix.",
                    "minimum": 0
                }
            },
            "required": [ "num_elements" ],
            "allOf": [ { "$ref": "../_compression/v1.json" } ]
        }
    },
    "required": [ "sparse_triplet_matrix" ],
    "_attributes": {
        "format": "text/csv",
        "restore": {
            "R": "alabaster.foo::loadSparseTripletMatrix"
        }
    }
}

This involves a few choices:

Once we’re happy, we can run the scripts/resolver.py script to resolve all the $ref references.

python3 scripts/resolver.py raw resolved

This creates a resolved/sparse_triplet_matrix/v1.json file that looks like:


{
    "$id": "sparse_triplet_matrix/v1.json",
    "$schema": "http://json-schema.org/draft-07/schema",
    "_attributes": {
        "format": "text/csv",
        "restore": {
            "R": "alabaster.foo::loadSparseTripletMatrix"
        }
    },
    "description": "Sparse matrix in triplet form, as a demonstration of how to use a schema for my dense matrix type. Data is stored in a CSV file with three columns - `i`, an integer field containing the 0-based row indices of each non-zero element; `j`, an integer field containing the column indices; and `x`, a double-precision field containing the value of the non-zero element.\n\nDerived from `array/v1.json`: some kind of multi-dimensional array, where we store metadata about the dimensions and type of data. The exact implementation of the array is left to concrete subclasses.",
    "properties": {
        "$schema": {
            "description": "The schema to use.",
            "type": "string"
        },
        "array": {
            "additionalProperties": false,
            "properties": {
                "dimensions": {
                    "description": "Dimensions of an n-dimensional array. Dimensions should be ordered from the fastest-changing to the slowest.",
                    "items": {
                        "type": "integer"
                    },
                    "minItems": 1,
                    "type": "array"
                },
                "type": {
                    "description": "Type of data stored in this array.",
                    "enum": [
                        "boolean",
                        "number",
                        "integer",
                        "string",
                        "other"
                    ],
                    "type": "string"
                }
            },
            "required": [
                "dimensions"
            ],
            "type": "object"
        },
        "is_child": {
            "default": false,
            "description": "Is this a child document, only to be interpreted in the context of the parent document from which it is linked? This may have implications for search and metadata requirements.",
            "type": "boolean"
        },
        "md5sum": {
            "description": "MD5 checksum for the file.",
            "type": "string"
        },
        "path": {
            "description": "Path to the file in the project directory.",
            "type": "string"
        },
        "sparse_triplet_matrix": {
            "properties": {
                "compression": {
                    "description": "Type of compression applied to the file.",
                    "enum": [
                        "none",
                        "gzip",
                        "bzip2"
                    ],
                    "type": "string"
                },
                "num_elements": {
                    "description": "Number of non-zero elements in the sparse matrix.",
                    "minimum": 0,
                    "type": "integer"
                }
            },
            "required": [
                "num_elements"
            ],
            "type": "object"
        }
    },
    "required": [
        "$schema",
        "array",
        "md5sum",
        "path",
        "sparse_triplet_matrix"
    ],
    "title": "Sparse Triplet Matrix ",
    "type": "object"
}

Let’s put this file in the inst/schemas/sparse_triplet_matrix subdirectory of alabaster.foo.

3 Writing a staging method

We need to create a method for the stageObject() generic that operates on the class of interest. This will be called whenever an instance of that class is encountered by stageObject(), either directly or as a child of another object (e.g., a SummarizedExperiment’s assay).

library(Matrix)
library(alabaster.base)
library(S4Vectors)

setMethod("stageObject", "dgTMatrix", function(x, dir, path, child=FALSE) {
    # Create a subdirectory to stash our contents.
    dir.create(file.path(dir, path), showWarnings=FALSE)

    # Create a DataFrame with the triplet data.
    df <- DataFrame(i = x@i, j = x@j, x = x@x)

    # .quickWriteCsv will make sure it's written in an 'alabaster-standard' format.
    outpath <- file.path(path, "foo.csv.gz")
    .quickWriteCsv(df, file.path(dir, outpath), compression="gzip")

    # Specifying the package name in the package attribute of the schema,
    # to ensure that .writeMetadata() can find it for validation.
    schema <- "sparse_triplet_matrix/v1.json"
    attr(schema, "package") <- "alabaster.foo"

    # Formatting the metadata for return.
    list(
        `$schema`=schema,

        # Reported path must be relative to 'dir'.
        path=outpath,

        # Pass along the 'child' specification from the call.
        is_child=child,

        `array`=list(
            # Need I() to prevent unboxing of length-1 vectors.
            dimensions=I(dim(x)),

            # double-precision values => 'number' in JSON schema's language.
            type="number"
        ),

        sparse_triplet_matrix=list(
            num_elements=nrow(df),
            compression="gzip"
        )
    )
})

Alright, let’s test this out with a mock dgTMatrix.

x <- sparseMatrix(
    i=c(1,2,3,5,6), 
    j=c(3,6,1,3,8), 
    x=runif(5), 
    dims=c(10, 10), 
    repr="T"
)
x
## 10 x 10 sparse Matrix of class "dgTMatrix"
##                                                        
##  [1,] .        . 0.4660350 . . .         . .        . .
##  [2,] .        . .         . . 0.7813227 . .        . .
##  [3,] 0.273472 . .         . . .         . .        . .
##  [4,] .        . .         . . .         . .        . .
##  [5,] .        . 0.7579172 . . .         . .        . .
##  [6,] .        . .         . . .         . 0.521723 . .
##  [7,] .        . .         . . .         . .        . .
##  [8,] .        . .         . . .         . .        . .
##  [9,] .        . .         . . .         . .        . .
## [10,] .        . .         . . .         . .        . .
tmp <- tempfile()
dir.create(tmp)
meta <- stageObject(x, tmp, "test")
str(meta)
## List of 5
##  $ $schema              : chr "sparse_triplet_matrix/v1.json"
##   ..- attr(*, "package")= chr "alabaster.foo"
##  $ path                 : chr "test/foo.csv.gz"
##  $ is_child             : logi FALSE
##  $ array                :List of 2
##   ..$ dimensions: 'AsIs' int [1:2] 10 10
##   ..$ type      : chr "number"
##  $ sparse_triplet_matrix:List of 2
##   ..$ num_elements: int 5
##   ..$ compression : chr "gzip"

Running .writeMetadata() will then validate meta against the schema requirements, using the schema file that we put inside alabaster.foo.

4 Writing a loading method

Finally, for loading, we need to define a new load* function for the new class that recreates the R object from the staged files. In this case, it’s fairly simple:

loadSparseTripletMatrix <- function(info, project) {
    # Need to get the file path.
    path <- acquireFile(project, info$path)

    # This utility will check that the CSV is correctly formatted,
    # which is more stringent than read.csv.
    df <- .quickReadCsv(path,
        expected.columns=c(i="integer", j="integer", x="double"),
        expected.nrows=info$sparse_triplet_matrix$num_elements,
        compression=info$sparse_triplet_matrix$compression,
        row.names=FALSE
    )

    # Constructor uses 1-based indices.
    sparseMatrix(
         i=df$i + 1L, 
         j=df$j + 1L, 
         x=df$x, 
         dims=info$array$dimensions, 
         repr="T"
    )
}

Let’s try it out with our previously staged dgTMatrix:

loadSparseTripletMatrix(meta, tmp)
## 10 x 10 sparse Matrix of class "dgTMatrix"
##                                                        
##  [1,] .        . 0.4660350 . . .         . .        . .
##  [2,] .        . .         . . 0.7813227 . .        . .
##  [3,] 0.273472 . .         . . .         . .        . .
##  [4,] .        . .         . . .         . .        . .
##  [5,] .        . 0.7579172 . . .         . .        . .
##  [6,] .        . .         . . .         . 0.521723 . .
##  [7,] .        . .         . . .         . .        . .
##  [8,] .        . .         . . .         . .        . .
##  [9,] .        . .         . . .         . .        . .
## [10,] .        . .         . . .         . .        . .

To register this method with alabaster.base, alabaster.foo should add itself to the known schema locations upon load:

# Typically in zzz.R.
.onLoad <- function(libname, pkgname) {
    existing <- options("alabaster.schema.locations")
    options(alabaster.schema.locations=union("alabaster.foo", existing))
}

.onUnload <- function(libpath) {
    existing <- options("alabaster.schema.locations")
    options(alabaster.schema.locations=setdiff(existing, "alabaster.foo"))
}

This ensures that loadObject() can find the resolved schema file in alabaster.foo’s inst/schemas when it encounters an instance of a sparse_triplet_matrix/v1.json. From the schema, loadObject() will look at the ’s _attributes.restore.R property to figure out how to load it into R, which will eventually lead to calling loadSparseTripletMatrix().

5 Advanced concepts

Some staging methods may need to save child objects, represented by the resource pointers in the schema. For example, a SummarizedExperiment would contain multiple children - each assay, the DataFrames for the column and row annotations, the metadata list, and so on. These are easily handled by just calling the staging methods of each child object inside the staging method for the parent object:

# Abbreviated example from artificer.se:
setMethod("stageObject", "SummarizedExperiment", function(x, dir, path, child=FALSE) {
    dir.create(file.path(dir, path), showWarnings=FALSE)

    # Saving the colData.
    info <- .stageObject(colData(x), dir, file.path(path, "coldata"), child=TRUE)
    cd.info <- list(resource=.writeMetadata(info, dir=dir))

    # Saving the rowData.
    info <- .stageObject(rowData(x), dir, file.path(path, "rowdata"), child=TRUE)
    rd.info <- list(resource=.writeMetadata(info, dir=dir))

    # Saving the other metadata.
    info <- .stageObject(metadata(x), dir, file.path(path, "metadata"), child=TRUE)
    other.info <- list(resource=.writeMetadata(info, dir=dir))

    # Saving the assays.
    assay.info <- list()
    for (a in assayNames(x)) {
        curmat <- assay(x, a)
        mat.path <- file.path(path, paste0("assay-", i))
        meta <- .stageObject(curmat, path=mat.path, dir=dir, child=TRUE)
        deets <- .writeMetadata(meta, dir=dir)
        assay.info <- c(assay.info, list(list(name=ass.names[i], resource=deets)))
    }

    list(
        `$schema`="summarized_experiment/v1.json",
        path=file.path(path, meta.name),
        summarized_experiment=list(
            assays=assay.info,
            column_data=cd.info,
            row_data=rd.info,
            other_data=meta.info,
            dimensions=dim(x)
        ),
        is_child=child
    )
})

When doing so, developers should use .stageObject() instead of calling stageObject() (note the period). The is a wrapper around ensures that the new staging method can respect application-specific overrides. Similarly, some loading methods should use .loadObject() when loading child objects:

# Abbreviated example from artificer.se:
loadSummarizedExperiment <- function(exp.info, project) {
    all.assays <- list()
    for (y in seq_along(exp.info$summarized_experiment$assays)) {
        cur.ass <- exp.info$summarized_experiment$assays[[y]]
        aname <- cur.ass$name
        apath <- cur.ass$resource$path
        ass.info <- acquireMetadata(project, apath)
        all.assays[[aname]] <- .loadObject(ass.info, project=project)
    }

    cd.info <- acquireMetadata(project, exp.info$summarized_experiment$column_data$resource$path)
    cd <- .loadObject(cd.info, project=project)
    rd.info <- acquireMetadata(project, exp.info$summarized_experiment$row_data$resource$path)
    rd <- .loadObject(rd.info, project=project)
    other.info <- acquireMetadata(project, exp.info$summarized_experiment$other_data$resource$path)
    other <- .loadObject(other.info, project=project)

    SummarizedExperiment(all.assays, colData=cd, rowData=rd, metadata=other, checkDimnames=FALSE)
}

Inside the loading functions, we always use acquireMetadata() and acquireFile() to obtain the metadata and file paths, respectively. This ensures that we respect any application-specific overrides, especially for project values that are not file paths.

6 Promoting the schema to alabaster.base

If the new schema refers to a standard Bioconductor object that is widely re-usable, you may consider submitting a PR with the (non-resolved) schema to ArtifactDB/BiocObjectSchemas.

If accepted, we will include your schema into the standard set of schemas known to alabaster.base. Once this is done, alabaster.foo can be simplified considerably:

You may also wish to create a PR to alabaster.base to add your class and package to the staging search path. This will prompt stageObject() to attempt to load your package upon encountering an instance of the class of interest, ensuring that the correct staging method is automatically called even if the user hasn’t loaded your package.

Session information

sessionInfo()
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] S4Vectors_0.38.0     BiocGenerics_0.46.0  Matrix_1.5-4        
## [4] alabaster.base_1.0.0 BiocStyle_2.28.0    
## 
## loaded via a namespace (and not attached):
##  [1] cli_3.6.1               knitr_1.42              rlang_1.1.0            
##  [4] xfun_0.39               DelayedArray_0.26.0     jsonlite_1.8.4         
##  [7] V8_4.3.0                htmltools_0.5.5         MatrixGenerics_1.12.0  
## [10] sass_0.4.5              rmarkdown_2.21          grid_4.3.0             
## [13] evaluate_0.20           jquerylib_0.1.4         fastmap_1.1.1          
## [16] IRanges_2.34.0          yaml_2.3.7              alabaster.schemas_1.0.0
## [19] Rhdf5lib_1.22.0         bookdown_0.33           jsonvalidate_1.3.2     
## [22] BiocManager_1.30.20     compiler_4.3.0          Rcpp_1.0.10            
## [25] rhdf5filters_1.12.0     lattice_0.21-8          rhdf5_2.44.0           
## [28] digest_0.6.31           R6_2.5.1                curl_5.0.0             
## [31] HDF5Array_1.28.0        bslib_0.4.2             tools_4.3.0            
## [34] matrixStats_0.63.0      alabaster.matrix_1.0.0  cachem_1.0.7