beachmat 1.2.1
beachmat has a few useful utilities outside of the C++ API. This document describes how to use them.
Given the dimensions of a matrix, users can choose HDF5 chunk dimensions that give fast performance for both row- and column-level access.
library(beachmat)
nrows <- 10000
ncols <- 200
getBestChunkDims(c(nrows, ncols))
## [1] 708 15
In the future, it should be possible to feed this back into the API.
Currently, if chunk dimensions are not specified in the C++ code, the API will retrieve them from R via the getHDF5DumpChunkDim()
function from HDF5Array.
The aim is to also provide a setHDF5DumpChunkDim()
function so that any chunk dimension specified in R will be respected.
The most common access patterns for matrices (at least, for high-throughput biological data) is by row or by column.
The rechunkByMargins()
will take a HDF5 file and convert it to using purely row- or column-based chunks.
library(HDF5Array)
A <- as(matrix(runif(5000), nrow=100, ncol=50), "HDF5Array")
byrow <- rechunkByMargins(A, byrow=TRUE)
byrow
## <100 x 50> HDF5Matrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.39654122 0.90981295 0.15698185 . 0.07715176 0.38199158
## [2,] 0.50387248 0.16403042 0.97257957 . 0.24375629 0.47965440
## [3,] 0.84990580 0.14099461 0.59938974 . 0.67124502 0.68904374
## [4,] 0.08297725 0.01458883 0.48380435 . 0.98778347 0.80442242
## [5,] 0.32825388 0.37359071 0.35928572 . 0.74132208 0.80409870
## ... . . . . . .
## [96,] 0.528873177 0.008692131 0.060870992 . 0.65175399 0.64597448
## [97,] 0.322905799 0.589599811 0.322770989 . 0.08191497 0.64729879
## [98,] 0.614903448 0.103008477 0.465179285 . 0.99478822 0.96743649
## [99,] 0.754406212 0.661506093 0.609382283 . 0.29000714 0.01524290
## [100,] 0.221138066 0.186687433 0.847517195 . 0.99814249 0.28305125
bycol <- rechunkByMargins(A, byrow=FALSE)
bycol
## <100 x 50> HDF5Matrix object of type "double":
## [,1] [,2] [,3] ... [,49] [,50]
## [1,] 0.39654122 0.90981295 0.15698185 . 0.07715176 0.38199158
## [2,] 0.50387248 0.16403042 0.97257957 . 0.24375629 0.47965440
## [3,] 0.84990580 0.14099461 0.59938974 . 0.67124502 0.68904374
## [4,] 0.08297725 0.01458883 0.48380435 . 0.98778347 0.80442242
## [5,] 0.32825388 0.37359071 0.35928572 . 0.74132208 0.80409870
## ... . . . . . .
## [96,] 0.528873177 0.008692131 0.060870992 . 0.65175399 0.64597448
## [97,] 0.322905799 0.589599811 0.322770989 . 0.08191497 0.64729879
## [98,] 0.614903448 0.103008477 0.465179285 . 0.99478822 0.96743649
## [99,] 0.754406212 0.661506093 0.609382283 . 0.29000714 0.01524290
## [100,] 0.221138066 0.186687433 0.847517195 . 0.99814249 0.28305125
Rechunking can provide a substantial speed-up to downstream functions, especially those requiring access to random columns or rows.
Indeed, the time saved in those functions often offsets the time spent in constructing a new HDF5Matrix
.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.7-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.7-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] HDF5Array_1.8.0 rhdf5_2.24.0 DelayedArray_0.6.0
## [4] BiocParallel_1.14.1 IRanges_2.14.10 S4Vectors_0.18.2
## [7] BiocGenerics_0.26.0 matrixStats_0.53.1 beachmat_1.2.1
## [10] knitr_1.20 BiocStyle_2.8.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 magrittr_1.5 stringr_1.3.1 tools_3.5.0
## [5] xfun_0.1 htmltools_0.3.6 yaml_2.1.19 rprojroot_1.3-2
## [9] digest_0.6.15 bookdown_0.7 Rhdf5lib_1.2.1 evaluate_0.10.1
## [13] rmarkdown_1.9 stringi_1.2.2 compiler_3.5.0 backports_1.1.2