OMICsPCA is a statistical framework designed to integrate and analyse multiple types of heterogeneous assays, factors, data types or modalities (e.g. ChIP-seq, RNA-seq) from different samples or sources ( e.g. multiple cell lines, time points etc.). Biological processes in eukaryotes are complicated molecular interactions, spanned over multiple layers of regulation. In order to understand the collective effects of the interactions of biological processes, they should be studied parallelly. High throughput sequencing has enabled genome-wide assays across several disease conditions, cell types, time points etc. at a much faster rate. On the other hand, replicated experiments on diverse samples (e.g. various cell lines) help us to understand the origin of variation (e.g. the genes, TSSs, exons etc) observed between the samples. Therefore, integrative multi-omics studies on heterogeneous samples are gaining their importance every day. As a consequence, a huge number of datasets are being produced and stored in public databases, allowing researchers to perform integrative analysis of multiple data modalities.
Unfortunately, most of such studies are limited to very few samples(e.g. cell lines) or data modalities (e.g. various ChIP-seq assays) due to the huge cost, ethical issues, and scarcity of samples associated with the entire experimental process. And, most of the data integration methods rely on the condition of homogeneity of samples, i.e. the samples corresponding to each modality should be the same. As a result, integrative analysis on publicly available data sets is very much rare and extremely restricted to very few numbers of samples or modalities (or both).
Let’s illustrate the problem with an example. Suppose our objective is to decipher the collective effect of CpG methylation and H3k4me3 histone modification on the gene expression, followed by clustering of the similar genes on the basis of the consolidated effect of these 2 data modalities. Consider we have the following datasets:
CpG methylation on 5 Cell lines A, B, C, D, and E
H3k4me3 histone modification on 10 Cell lines A,D,E,I,K,Q,P,Z,W and G.
Here, we can not calculate the collective effect directly, since the cell lines and the number of cell lines are different in the two experiments.
Several data integration frameworks have been designed over the past few years to overcome such difficulty. For example, Huang (1) proposed a regression-based joint modeling approach to integrate SNPs, DNA methylation, and gene expression; He et al. (2)developed a pattern matching method to integrated eQTL and GWAS data; Xiong et al. (3) developed a statistical framework to associate SNP and gene expression information etc. All such applications are designed to perform in a supervised manner, i.e. the integrated data modalities must be associated with labelled classes (e.g. disease and control class; pathway 1 and pathway 2 etc.). Such constraints impose a limitation on the algorithms to be used in varieties of applications (e.g., can’t be used in clustering). In contrast, Ha, et al. (4) developed an unsupervised Gaussian process model for qualitative (logical combination of binary values, e.g. 0/1 or TRUE/FALSE) integration of a large number of epigenetic factors and applied it to describe cell lineage-specific gene regulatory programs. Although this overcomes the limitation of class association to some extent, this process is only sufficient for the applications where qualitative integration is sufficient.
As mentioned earlier, there is another important application of data integration, which is associated with the disentanglement of the source of variation observed between the samples or the individuals. A common strategy used in such studies is to perform a large number of association tests between the features and the data modalities (5, 6). Alternatively, kernel or graph-based methods are also considered as methods of integration and the integrated data is then used to build common similarity networks between samples (7, 8). Although effective in accomplishing specific tasks, such approaches lack proper interpretability (9) and do not suffice to explain many other relevant queries applicable to integrated omics data. Such queries include identification of variably controlled genes, genes sharing similar epigenomic state across various samples and many more.
OMICsPCA is a multipurpose tool designed to overcome the difficulties associated with both the applications of data integration. It is designed to identify the source of variation among the samples/variables (the columns of a table, e.g. cell lines, patients etc.) or individuals (the rows of a table, e.g. genes, exons etc.) associated with each data modality (i.e. the tables, e.g. CpG_methylation, ChIP-seq on a protein etc.). It guides the user through various analyses to decide on similar samples/variables (e.g.Cell lines). This is followed by selecting the data modalities (e.g. the assays) that can be included or excluded leading to data integration. The selected data modalities can be further integrated and analysed and the data points or individuals (e.g. genes, TSSs, exons) may be clustered on the basis of the integrated data coming from single or multiple modalities.
In short, OMICsPCA can be used in clustering genes, TSSs, exons on the basis of multiple types of heterogeneous experimental or theoretical data or vice-versa. Such kind of multivariate analyses is useful in identification of similar cell lines or assays prior to the integration process. In this vignette, we show an example of the entire analysis process done on human transcription start sites.
In the following example, we run our analyses on a subset of 3 epigenetic Assays and one expression assay. All the data are publicly available at ENCODE and ROADMAP epigenomics project portal. The data is stored in the MultiAssayExperiment object named multi_assay, which is stored in the supporting data package OMICsPCAdata.
More detailed description of the datasets are provided at the man page of multi_assay, which can be accessed by ?multi_assay
In this section we demonstrate the functionality of OMICsPCA with an example. We divided the GENCODE annotated TSSs into 4 groups on the basis of their on-off state in 31 cell lines and mapped Histone ChIP-seq data with them in order to study the consolidated effect of these data modalities on their expression. Before calculating the consolidated effect, we filtered out the outlier samples and identified the data modalities that show discriminatory density pattern among the expression groups. Finally, we clustered the similar TSSs together on the basis of the consolidated effect of discriminatory factors.
OMICsPCA has an inbuilt function prepare_dataset() that aggregates the samples (e.g. cell lines) of each data modality (e.g. a histone modification ChIP-seq assay) corresponding to each feature of the genome (notably gene, TSS, exon etc.) and returns a dataframe.
The rows of this dataframe object contain the features (e.g. genes) and each column contains the values corresponding to a sample (e.g. Cell line).
The arguments taken by this function is described below:
factdir
: full path of the directory containing files corresponding to various cell lines in .bed format. This directory should contain only the files of the same assay (modality) (e.g. H3k9ac) from different samples (e.g. cell lines).
annofile
: full path of the file containing the annotation file (e.g exons, TSSs, genes etc) in .bed format.
annolist
: name of full set or subset of the entries in the annotation file. The .bed files should have at least 4 columns. The first column is the chromosome name, the second and third is the start and end position of the feature (e.g. a gene or a ChIP-seq peak etc). The fourth column contains the value of the described feature (e.g. expression of a gene, the ChIP-seq intensity of a peak or name of a gene etc.).
The .bed
files should have at least 4 columns. The first column is the chromosome name, the second and third is the start and end of the feature (e.g. a gene or a ChIP-seq peak etc) accordingly and the fourth column contains the value (e.g. expression of a gene, ChIP-seq intensity of a peak or name of a gene etc.).
Some examples of the input file formats are given below:
chromosome | start | end | intensity |
---|---|---|---|
chr1 | 29533 | 29590 | 3.665498 |
chr1 | 53000 | 53010 | 3.798148 |
chr1 | 160430 | 160467 | 4.294974 |
chr1 | 893924 | 895151 | 3.160494 |
chr1 | 895275 | 896888 | 3.881008 |
chromosome | start | end | TSS ID |
---|---|---|---|
chr1 | 11858 | 11885 | TSS1 |
chr1 | 11999 | 12021 | TSS2 |
chr1 | 29543 | 29565 | TSS3 |
chr1 | 30256 | 30278 | TSS4 |
chr1 | 53038 | 53060 | TSS5 |
TSS ID |
---|
TSS1 |
TSS2 |
TSS3 |
TSS4 |
TSS5 |
Here we show the use of prpareDataset() by integrating demo peaks (.bed format) from 2 files into a data frame named. The example data sets are packaged inside the OMICsPCAdata
package
anno <- system.file("extdata/annotation2/TSS_groups.bed",
package = "OMICsPCAdata")
list <- system.file("extdata/annotation2/TSS_list",
package = "OMICsPCAdata")
fact <- system.file("extdata/factors2/demofactor",
package = "OMICsPCAdata")
list.files(path = fact)
# [1] "Cell1.bed" "Cell2.bed"
prepare_dataset() will combine these 2 bed files into a dataframe. The rows of the dataframe will correspond to the entries in file entered through the variable named list
and the columns will correspond to the 2 files shown above.
all_cells <- prepare_dataset(factdir = fact,
annofile = anno, annolist = list)
# [1] "Running intersect... This may take some time"
# [1] "Merging cell lines... This may take some time"
# [1] "Total time taken is: 0.738696575164795"
# [1] "time taken to run intersect() is: 0.727060794830322"
# [1] "time taken to run merge_cells() is: 0.0116357803344727"
head(all_cells[c(1,14,15,16,20),])
# # A tibble: 5 x 2
# Cell1.bed Cell2.bed
# <dbl> <dbl>
# 1 0 0
# 2 3.67 0
# 3 0 1.86
# 4 0 0
# 5 4.29 0
For a different assay type repeat the process:
factor_x <- prepare_dataset(factdir = fact,
annofile = anno, annolist = list)
# where `fact` is the directory of the samples corresponding to factor_x
For example,
fact <- system.file("path to cpg")
cpg <- prepare_dataset(factdir = fact, annofile=anno, annolist = list)
OMICsPCA is designed to read various assays done on various cell lines into an MultiAssayExperiment class object and then run various analyses on this object.
library(MultiAssayExperiment)
datalist <- data(package = "OMICsPCAdata")
datanames <- datalist$results[,3]
data(list = datanames)
assaylist <- list("demofactor" = all_cells)
demoMultiAssay <- MultiAssayExperiment(experiments = assaylist)
# ExperimentList contains data.frame or DataFrame,
# potential for errors with mixed data types
For ease of analysis, we compiled some preloaded data into an MultiAssayExperiment object named multi_assay.
Once the dataframes corresponding to the assays/modalities is prepared and stored in the “MultiAssayExperiment” object“, the user may want to divide the entire annotation set into smaller groups. Here we show an example to divide 28770 GENCODE TSS groups into 4 expression groups. The grouping criteria are based on the on/off status (determined from CAGE experiments) of the TSSs in 31 cell lines.
OMICsPCA has a function create_group() to do this task. This function takes the following 4 arguments:
group_names : a vector containing the user-defined names of the groups to be created
factor : name of the data frame on which groups will be created
condition : a vector of conditions corresponding to comparison
. condition should be a vector or range of digits (e.g. c(1,3,7,9). or 1:5) if %in% is chosen as comparison and a single digit for other cases.
name : name of the MultiAssayExperiment object containing the assay data
grouping_factor : name of the assay/modality on which grouping will be done
groupinfo <- create_group(name = multi_assay, group_names = c("WE" ,
"RE" ,"NE" ,"IntE"),
grouping_factor = "CAGE",
comparison = c(">=" ,"%in%" ,"==" ,"%in%"),
condition = c("25" ,"1:5" ,"0" ,"6:24"))
In the above example the dataframe CAGE is divided into 4 smaller data frames e.g; WE, RE, NE and IntE. “WE” contains TSSs expressed in >= 25 cell lines, “RE” represents TSSs expressed in 1 to 5 cell lines and so on.
The CAGE data frame looks like this:
A549 | AG04450 | B.Cells_CD20. | BJ | Gm12878 | |
---|---|---|---|---|---|
ENST00000387347.2 | 0 | 0 | 85.8895806558477 | 0 | 0 |
ENST00000387314.1 | 49.6336910685519 | 0 | 0 | 0 | 0 |
ENST00000389680.2 | 0 | 0 | 0 | 0 | 0 |
ENST00000386347.1 | 61.4568108749758 | 19.4710468352727 | 23.969601604456 | 17.9526438211874 | 129.285323659344 |
ENST00000361390.2 | 196.720325095994 | 121.562481593189 | 216.782580604901 | 60.083445046716 | 74.770012182987 |
The groupinfo object is accessd by subsequent functions. So this object has to be present in the global environment to run other functions. If the user do not wish to divide the datasets into groups or if the user has predefined groups, then this information may be read directly from a file or object having this information. We compiled a predefined group format in a dataframe groupinfo_ext and compiled it in OMICsPCAdata.
group | |
---|---|
ENST00000361390.2 | WE |
ENST00000361335.1 | WE |
ENST00000361567.2 | WE |
ENST00000293860.5 | WE |
ENST00000400266.3 | WE |
Once the group information is read, we can study the the distribution of various Assays/factors(e.g. H2az, H3k9ac etc) on the annotations (e.g. TSSs, genes) of various groups (In the above example, the groups are created on the basis of on-off state of TSSs in 31 cell lines). OMICsPCA has several functions for EDA.
Descriptor works in two modes depending on the value passed through the argument choice
. choice = 1 displays panels of boxplots corresponding to the Assay(s) or factor(s) or modality(ies) chosen through argument factor
. Each panel shows a number of boxplots corresponding to the group(s) passed through argument groups
. Each boxplot in choice 1 represents the distribution of the percentage of cell lines corresponding to a group that is overlapped with the selected assay.
NOTE : in groups
, one group name should be used only once.
NOTE : groups
can handle only the predefined groups or the groups created by “create_group()”
descriptor(name = multi_assay,
factors = c(
"H2az",
"H3k4me1","H3k9ac"),
groups = c("WE","RE"),
choice = 1,
title = "Distribution of percentage of cell types overlapping
with various factors",
groupinfo = groupinfo)
Above boxplot shows that “H2az” and “H3k9ac” overlaps with Widely Expressed (“WE”) TSSs in almost 100 % cell lines of respective assays. Thus they can be thought of as characteristics of “WE”. In addition, almost the opposite trait is observed in “NE”, which in turn, indicates that these 6 properties may be the discriminatory factors between the various expression groups.
In choice
= 2, a value needs to be supplied to the argument choice2group
. This value should be the name of one of the groups created by “create_group()”. The plot displayed in this mode represents the distribution of the percentage of celll lines corresponding to a factor/Assay/modality with various combination of other Assays/factors/modalities supplied through factors
.
descriptor(name = multi_assay,
factors = c("H2az",
"H3k4me1","H3k9ac"),
groups = c("WE","RE"),
choice = 2,
choice2group = "WE",
title = "Distribution of percentage of cell types overlapping with
a factor in various combinations of epigenetic marks in the
selected group",
groupinfo = groupinfo)
The above set of boxplots indicates that most of the epigenetic factors in Widely expressed TSSs (“WE”) are present with other factors. This gives us an indication of the interplay between various epigenetic marks
This function creates pairwise correlation and scatter plots and histogram on selected groups. The type of results should be passed through the argument choice
. This is a wrapper function on various functions and thus can take additional and non-conflicting arguments specific to them.
choice | functions | output |
---|---|---|
table | cor {stats} | correlation table |
scatter | pairs {graphics} | scatterplot of each pair |
hist | ggplot,geom_histogram,facet_wrap{ggplot2} | histogram of each column |
all | chart.Correlation {PerformanceAnalytics} | all of the above together |