This vignette demonstrates how to manually collect GO gene sets from BioMart. The source code for retrieving GO gene sets for all supported organisms on BioMart can be found from the following location:
## [1] "/tmp/Rtmp87D2sD/Rinst11e3823173dad/BioMartGOGeneSets/scripts/biomart_genesets.R"
Retrieving GO gene sets from BioMart is a two-step process.
The relations can be obtained from BioMart with the R package biomaRt. First let’s create a “Mart” object which is like a connection to the BioMart web-service. Because this is a gene-related relation, we set argument biomart
to "genes"
.
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## No dataset selected.
Please note, if later when you use biomaRt and see errors of “timeout”, you can select a different mirror, such as:
Next we need to select an organism. In BioMart, it is called a “dataset”. To get a proper value of an organism dataset, we can use listDatasets()
to get a list of supported datasets.
## [1] 214 3
## dataset description
## 1 abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1)
## 2 acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2)
## 3 acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2)
## 4 acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2)
## 5 acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5)
## 6 amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2)
## version
## 1 ASM259213v1
## 2 fAstCal1.2
## 3 AnoCar2.0v2
## 4 bAquChr1.2
## 5 Midas_v5
## 6 ASM200744v2
You can see there are a huge number of organisms supported in BioMart. The first column in datasets
contains valid values for dataset
which will be used in later functions.
Sometimes it is not easy to find the dataset of your organism, especially when the organism is not a model organism. The second column in datasets
contains dataset descriptions, which sometimes can be helpful to find the right one.
In this example, we use the dataset for human "hsapiens_gene_ensembl"
.
To use the dataset, we specify dataset
in the function useDateset()
, which is like adding a flag of which dataset to use.
## Object of class 'Mart':
## Using the ENSEMBL_MART_ENSEMBL BioMart database
## Using the hsapiens_gene_ensembl dataset
The dataset is like a giant table with a huge number of columns which provide massive additional information for genes. Here we are only interested in GO-related information. In the dataset, the table columns are called “attributes”. There are a huge number of supported attributes in a dataset. The complete list of attributes can be obtained by the function listAttributes()
.
## [1] 3169 3
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_gene_id_version Gene stable ID version feature_page
## 3 ensembl_transcript_id Transcript stable ID feature_page
## 4 ensembl_transcript_id_version Transcript stable ID version feature_page
## 5 ensembl_peptide_id Protein stable ID feature_page
## 6 ensembl_peptide_id_version Protein stable ID version feature_page
To get proper values for the attributes, we need to go through the long table and sometimes this is not an easy task. The three attributes of GO-gene relations are c("ensembl_gene_id", "go_id", "namespace_1003")
. Now we can use the function getBM()
to obtain the GO-gene relation table.
Check the first several rows in go
:
## ensembl_gene_id go_id namespace_1003
## 1 ENSG00000210049 GO:0030533 molecular_function
## 2 ENSG00000210049 GO:0006412 biological_process
## 3 ENSG00000211459
## 4 ENSG00000210077
## 5 ENSG00000210082 GO:0003735 molecular_function
## 6 ENSG00000210082 GO:0005840 cellular_component
For some organisms, the returned table might be huge, or due to the bad internet connection, the data retrieving may exceed the 5 min limit on BioMart, and you might see the following error:
Error in curl::curl_fetch_memory(url, handle = handle) :
Timeout was reached: [uswest.ensembl.org:443] Operation timed out after 300005 milliseconds with 2966434 bytes received
In this case, you might need to split the query into blocks and make sure each job query is small.
genes = getBM(attributes = "ensembl_gene_id", mart = ensembl)[, 1]
go1 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1:1000])
go2 = getBM(attributes = at, mart = ensembl, filter = "ensembl_gene_id", value = genes[1001:2000])
...
rbind(go1, go2, ...)
In this example, we will only demonstrate the Biological Process Ontology, and we convert the data frame go
to a list of genes.
go = go[go$namespace_1003 == "biological_process", , drop = FALSE]
gs = split(go$ensembl_gene_id, go$go_id)
In gs
, there are a list of vectors where each vector can be thought as a gene set for the corresponding GO term.
GO has a tree structure where a parent term includes all genes annotated to its child terms. For every GO terms in gs
, we need to merge genes from all its child or offspring terms to form a complete gene set for this GO term.
The hierarchical structure of GO terms is stored in the GO.db package. In the following code, bp_terms
contains a vector of GO terms only in the Biological Process ontology. The variable GOBPOFFSPRING
is simply a list where each element vector contains all offspring terms (child and remote downstream terms) of a GO term.
library(GO.db)
bp_terms = GOID(GOTERM)[Ontology(GOTERM) == "BP"]
GOBPOFFSPRING = as.list(GOBPOFFSPRING)
Now it is quite easy to merge genes from offspring terms. Just note as the final step, empty GO gene sets should be removed.
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GO.db_3.17.0 AnnotationDbi_1.62.2 IRanges_2.34.1
## [4] S4Vectors_0.38.1 Biobase_2.60.0 BiocGenerics_0.46.0
## [7] biomaRt_2.56.1 knitr_1.43
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.40.0 xfun_0.40 bslib_0.5.1
## [4] vctrs_0.6.3 tools_4.3.1 bitops_1.0-7
## [7] generics_0.1.3 curl_5.0.2 tibble_3.2.1
## [10] fansi_1.0.4 RSQLite_2.3.1 blob_1.2.4
## [13] pkgconfig_2.0.3 dbplyr_2.3.3 lifecycle_1.0.3
## [16] GenomeInfoDbData_1.2.10 compiler_4.3.1 stringr_1.5.0
## [19] Biostrings_2.68.1 progress_1.2.2 GenomeInfoDb_1.36.2
## [22] htmltools_0.5.6 sass_0.4.7 RCurl_1.98-1.12
## [25] yaml_2.3.7 pillar_1.9.0 crayon_1.5.2
## [28] jquerylib_0.1.4 cachem_1.0.8 tidyselect_1.2.0
## [31] digest_0.6.33 stringi_1.7.12 purrr_1.0.2
## [34] dplyr_1.1.3 fastmap_1.1.1 cli_3.6.1
## [37] magrittr_2.0.3 XML_3.99-0.14 utf8_1.2.3
## [40] withr_2.5.0 prettyunits_1.1.1 filelock_1.0.2
## [43] rappdirs_0.3.3 bit64_4.0.5 rmarkdown_2.24
## [46] XVector_0.40.0 httr_1.4.7 bit_4.0.5
## [49] png_0.1-8 hms_1.1.3 memoise_2.0.1
## [52] evaluate_0.21 BiocFileCache_2.8.0 rlang_1.1.1
## [55] glue_1.6.2 DBI_1.1.3 xml2_1.3.5
## [58] jsonlite_1.8.7 R6_2.5.1 zlibbioc_1.46.0