TxRegQuery addresses exploration of transcriptional regulatory networks by integrating data on eQTL, digital genomic footprinting (DGF), DnaseI hypersensitivity binding data (DHS), and transcription factor binding site (TFBS) data. Owing to the volume of emerging tissue-specific data, special data modalities are used.
txregnet
databaseWe have a long-running server that will respond to queries. We focus on mongolite as the interface.
suppressPackageStartupMessages({
library(TxRegInfra)
library(mongolite)
library(Gviz)
library(EnsDb.Hsapiens.v75)
library(BiocParallel)
register(SerialParam())
})
con1 = mongo(url=URL_txregInAWS(), db="txregnet")
con1
## <Mongo collection> 'test'
## $aggregate(pipeline = "{}", options = "{\"allowDiskUse\":true}", handler = NULL, pagesize = 1000, iterate = FALSE)
## $count(query = "{}")
## $disconnect(gc = TRUE)
## $distinct(key, query = "{}")
## $drop()
## $export(con = stdout(), bson = FALSE, query = "{}", fields = "{}", sort = "{\"_id\":1}")
## $find(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0, handler = NULL, pagesize = 1000)
## $import(con, bson = FALSE)
## $index(add = NULL, remove = NULL)
## $info()
## $insert(data, pagesize = 1000, stop_on_error = TRUE, ...)
## $iterate(query = "{}", fields = "{\"_id\":0}", sort = "{}", skip = 0, limit = 0)
## $mapreduce(map, reduce, query = "{}", sort = "{}", limit = 0, out = NULL, scope = NULL)
## $remove(query, just_one = FALSE)
## $rename(name, db = NULL)
## $replace(query, update = "{}", upsert = FALSE)
## $run(command = "{\"ping\": 1}", simplify = TRUE)
## $update(query, update = "{\"$set\":{}}", filters = NULL, upsert = FALSE, multiple = FALSE)
We will write methods that work with the ‘fields’ of this object.
There is not much explicit reflectance in the mongolite API. The following is improvised and may be fragile:
## $name
## [1] "test"
##
## $db
## [1] "txregnet"
##
## $url
## [1] "mongodb+srv://user:user123@cluster1-ag7nd.mongodb.net/test"
##
## $options
## List of 6
## $ pem_file : NULL
## $ ca_file : NULL
## $ ca_dir : NULL
## $ crl_file : NULL
## $ allow_invalid_hostname: logi FALSE
## $ weak_cert_validation : logi FALSE
If the mongo
utility is available as a system
command, we can get a list of collections in the database
as follows.
Otherwise, as long as mongolite is installed, as long as we know the collection names of interest, we can use them as noted throughout this vignette.
We can get a record from a given collection:
mongo(url=URL_txregInAWS(), db="txregnet",
collection="Adipose_Subcutaneous_allpairs_v7_eQTL")$find(limit=1)
## gene_id variant_id tss_distance ma_samples ma_count
## 1 ENSG00000233750.3 1_743420_G_A_b37 612395 28 30
## maf pval_nominal slope slope_se qvalue chr snp_pos A1 A2 build
## 1 0.0391645 0.000143004 0.559146 0.145246 0.0130483 1 743420 G A b37
Queries can be composed using JSON. We have a tool to generate queries that employ the mongodb aggregation method. Here we demonstrate this by computing, for each chromosome, the count and minimum values of the footprint statistic on CD14 cells.
m1 = mongo(url = URL_txregInAWS(), db = "txregnet", collection="CD14_DS17215_hg19_FP")
newagg = makeAggregator( by="chr", vbl="stat", op="$min", opname="min")
The JSON layout of this aggregating query is
[
{
"$group": {
"_id": ["$chr"],
"count": {
"$sum": [1]
},
"min": {
"$min": ["$stat"]
}
}
}
]
Invocation returns a data frame:
## _id count min
## 1 chrY 827 0.01907390
## 2 chr18 15868 0.06107950
## 3 chr10 40267 0.00601357
## 4 chr4 32947 0.02776440
## 5 chr6 54728 0.00565057
## 6 chr17 47987 0.01242310
We need to bind the metadata and information about the mongodb.
The following turns a very ad hoc filtering of the collection names into a DataFrame.
## DataFrame with 2 rows and 3 columns
## base type
## <character> <character>
## Adipose_Subcutaneous_allpairs_v7_eQTL Adipose eQTL
## CD14_DS17215_hg19_FP CD14 FP
## mid
## <character>
## Adipose_Subcutaneous_allpairs_v7_eQTL Subcutaneous_allpairs_v7
## CD14_DS17215_hg19_FP DS17215_hg19
A key method in development is subsetting the archive by genomic coordinates.
## ..........................................
## class: RaggedExperiment
## dim: 1555 41
## assays(3): chr id stat
## rownames: NULL
## colnames(41): CD14_DS17215_hg19_FP CD19_DS17186_hg19_FP ...
## iPS_19_11_DS15153_hg19_FP vHMEC_DS18406_hg19_FP
## colData names(3): base type mid
## [1] 1555 41
## fLung_DS14724_hg19_FP fMuscle_arm_DS17765_hg19_FP
## chr17:38083984-38083990 NA 0.884509
## chr17:38084524-38084551 NA 0.671296
## chr17:38084552-38084571 NA 0.667570
## chr17:38083560-38083566 NA NA
ormm = txmodels("ORMDL3", plot=FALSE, name="ORMDL3")
sar = strsplit(rownames(sa), ":|-")
an = as.numeric
gr = GRanges(seqnames(ormm)[1], IRanges(an(sapply(sar,"[", 2)), an(sapply(sar,"[", 3))))
gr1 = gr
gr1$score = 1-sa[,1]
gr2 = gr
gr2$score = 1-sa[,2]
sc1 = DataTrack(gr1, name="Lung FP")
sc2 = DataTrack(gr2, name="Musc/Arm FP")
plotTracks(list(GenomeAxisTrack(), sc1, sc2, ormm), showId=TRUE)