Traditional methods of analyzing gene expression data in the study of some disease usually compare the disease and normal control groups of samples and find the most differentially expressed genes. But that is hard to discover the disease’s biomarkers and mechanism. To give a quantitative comparison of the complex disease, we achieve PFP, a good characterization for a person’s disease based on pathway on the open scientific computing platform R. In this package, a pathway-fingerprint (PFP) method was introduced to evaluate the importance of a gene set in different pathways to help researchers focus on the most related pathways and genes.It will be used to visually compare and parse different diseases by generating a fingerprint overlay. We collected three types of gene expression data to perform the enrichment analysis in KEGG pathways and make some comparations with other methods. The result indicated that Pathway Fingerprint had better performance than other enrichment tools, which not only picked out the most relevant pathways but also showed strong stability when changing data. we propose a novel, general and systematic method called Pathway Fingerprint to help researchers focus on the fatal pathways and genes by considering the topology knowledge.
The three main features of PFP:
PFP requires these packages: graph, igraph, KEGGgraph, clusterProfiler, ggplot2, plyr,tidy,magrittr, stats, methods and utils. To install PFP, some required packages are only available from Bioconductor. It also allows users to install the latest development version from github, which requires devtools package has been installed on your system or can be installed using install.packages("devtools")
. Note that devtools sometimes needs some extra non-R software on your system – more specifically, an Rtools download for Windows or Xcode for OS X. There’s more information about devtools here. You can install PFP via Bioconductor.
## install PFP from github, require biocondutor
## dependencies package pre-installed
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
::install("PFP") BiocManager
You can also install PFP via github.
## install PFP from github, require biocondutor
## dependencies package pre-installed
if (!require(devtools)) install.packages("devtools")
::install_github("aib-group/PFP") devtools
During analysis, you need to install org.Hs.eg.db, the installation strategy is as follows.
## install PFP from github, require biocondutor
## dependencies package pre-installed
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
::install("org.Hs.eg.db") BiocManager
After installation, the *{PFP} is ready to load into the current workspace by the following codes to the current workspace by typing or pasting the following codes:
library(PFP)
In our method, we choose KEGG(http://www.kegg.jp/) pathway networks as a reference to generate a Pathway Fingerprint. KEGG provides KGML files of pathways for users, which enables automatic drawing of KEGG pathways and provides facilities for computational analysis and modeling of gene/protein networks and chemical networks. We downloaded the latest (2020.11.8) KGML files of all the human pathways in KEGG and translated the KGML files to network.Then we got a total number of 338 pathway networks for further analysis.
There are different methods for the different three types of data to identify DEGs. We processed the microarray data by limma, besides, we also selected some cancer samples of the same type of cancer and compared them with the control group by edgeR.In both limma and edgeR, we only chose the genes whose log2 fold change (logFC) was greater than 1 and false discovery rate (FDR) was less than 0.05.
We defined a new S4 class PFP
to store the score. PFP also provides six major methods for this S4 class:
genes_score()
: Gene score, adding function to specified selection group/ pathway.sub_PFP()
: A portion of the PFP can be selected by group, slice, path name, and ID.show()
: Display the network group name, group size, and PFP score for each channelplot_PFP()
: Display PFP fingerprinting.refnet_names()
: Extract base network group namesrank_PFP()
: To achieve the path weight ranking, the preferred P value, and then the PFP score. Detailed instructions for this five methods refer to package function help.We also defined a new S4 class PFPRefnet
to store the reference pathway network information of KEGG, it provides six methods for this S4 class::
network()
: Reference path network of KEGG.net_info()
: Pathway information.group()
: Group information.refnet_names()
: The access information of the reference network.subnet()
: A portion of the PFPRefnet can be selected by group, slice, path name, and ID.show()
: Show the number of pathways in each group of the reference network.Then the PFP can be calculated as following:
# load the data -- gene list of human; the PFPRefnet object
# of human; the PFP object to test; the list of different
# genes.
data("gene_list_hsa")
data("PFPRefnet_hsa")
data("PFP_test1")
data("data_std")
# Step1: calculate the similarity score of network.
calc_PFP_score(genes = gene_list_hsa, PFPRefnet = PFPRefnet_hsa)
PFP_test <-# Step2: rank the pathway by the PFP score.
rank_PFP(object = PFP_test, total_rank = TRUE, thresh_value = 0.5) rank1 <-
We study the target pathway, the pathway with the highest score after ranking.Below is a simple example.
# Step1: select the max score of pathway.
refnet_info(rank1)[1, "id"]
pathway_select <- pathways_score(rank1)$genes_score[[pathway_select]]$ENTREZID
gene_test <-# Step2: get the correlation coefficient score of the edge.
get_exp_cor_edges(gene_test, data_std)
edges_coexp <-# Step3: Find the difference genes that are of focus.
unique(c(edges_coexp$source, edges_coexp$target))
gene_list2 <-# Step4: Find the edge to focus on.
get_bg_related_kegg(gene_list2, PFPRefnet = PFPRefnet_hsa,
edges_kegg <-rm_duplicated = TRUE)
# Step5: Find the associated network
require(org.Hs.eg.db)
get_asso_net(edges_coexp = edges_coexp, edges_kegg = edges_kegg,
net_test <-if_symbol = TRUE, gene_info_db = org.Hs.eg.db)
PFP provides the plot_PFP()
function to visualize the network fingerprint of a single query network. First we show an example of PFP score.
plot_PFP(PFP_test)
Plot the scores from high to low.
plot_PFP(rank1)
The version number of R and packages loaded for generating the vignette were:
#> R Under development (unstable) (2022-10-25 r83175)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] org.Hs.eg.db_3.16.0 AnnotationDbi_1.61.0 IRanges_2.33.0
#> [4] S4Vectors_0.37.0 Biobase_2.59.0 BiocGenerics_0.45.0
#> [7] PFP_1.7.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.1.3 bitops_1.0-7 gson_0.0.9
#> [4] shadowtext_0.1.2 gridExtra_2.3 formatR_1.12
#> [7] rlang_1.0.6 magrittr_2.0.3 DOSE_3.25.0
#> [10] compiler_4.3.0 RSQLite_2.2.18 png_0.1-7
#> [13] vctrs_0.5.0 reshape2_1.4.4 stringr_1.4.1
#> [16] pkgconfig_2.0.3 crayon_1.5.2 fastmap_1.1.0
#> [19] XVector_0.39.0 labeling_0.4.2 ggraph_2.1.0
#> [22] utf8_1.2.2 HDO.db_0.99.1 rmarkdown_2.17
#> [25] enrichplot_1.19.0 graph_1.77.0 KEGGgraph_1.59.0
#> [28] purrr_0.3.5 bit_4.0.4 xfun_0.34
#> [31] zlibbioc_1.45.0 cachem_1.0.6 aplot_0.1.8
#> [34] GenomeInfoDb_1.35.0 jsonlite_1.8.3 blob_1.2.3
#> [37] highr_0.9 BiocParallel_1.33.0 tweenr_2.0.2
#> [40] parallel_4.3.0 R6_2.5.1 RColorBrewer_1.1-3
#> [43] bslib_0.4.0 stringi_1.7.8 jquerylib_0.1.4
#> [46] GOSemSim_2.25.0 Rcpp_1.0.9 assertthat_0.2.1
#> [49] knitr_1.40 downloader_0.4 Matrix_1.5-1
#> [52] splines_4.3.0 igraph_1.3.5 tidyselect_1.2.0
#> [55] viridis_0.6.2 qvalue_2.31.0 yaml_2.3.6
#> [58] codetools_0.2-18 lattice_0.20-45 tibble_3.1.8
#> [61] plyr_1.8.7 treeio_1.23.0 withr_2.5.0
#> [64] KEGGREST_1.39.0 evaluate_0.17 gridGraphics_0.5-1
#> [67] scatterpie_0.1.8 polyclip_1.10-4 Biostrings_2.67.0
#> [70] ggtree_3.7.0 pillar_1.8.1 clusterProfiler_4.7.0
#> [73] ggfun_0.0.7 generics_0.1.3 RCurl_1.98-1.9
#> [76] ggplot2_3.3.6 tidytree_0.4.1 munsell_0.5.0
#> [79] scales_1.2.1 glue_1.6.2 lazyeval_0.2.2
#> [82] tools_4.3.0 data.table_1.14.4 fgsea_1.25.0
#> [85] graphlayouts_0.8.3 XML_3.99-0.12 fastmatch_1.1-3
#> [88] tidygraph_1.2.2 cowplot_1.1.1 grid_4.3.0
#> [91] ape_5.6-2 tidyr_1.2.1 colorspace_2.0-3
#> [94] nlme_3.1-160 patchwork_1.1.2 GenomeInfoDbData_1.2.9
#> [97] ggforce_0.4.1 cli_3.4.1 fansi_1.0.3
#> [100] viridisLite_0.4.1 dplyr_1.0.10 Rgraphviz_2.43.0
#> [103] gtable_0.3.1 yulab.utils_0.0.5 sass_0.4.2
#> [106] digest_0.6.30 ggplotify_0.1.0 ggrepel_0.9.1
#> [109] farver_2.1.1 memoise_2.0.1 htmltools_0.5.3
#> [112] lifecycle_1.0.3 httr_1.4.4 GO.db_3.16.0
#> [115] bit64_4.0.5 MASS_7.3-58.1