ExpressAnalystR is the underlying R package synchronized with ExpressAnalyst web server. It is designed for statistical analysis, enrichment analysis and visual analytics of single and multiple gene expression data, both matrix and gene list. The R package is composed of R functions necessary for the web-server to perform data annotation, normalization, differential expression and meta-analysis.
Following installation and loading of ExpressAnalystR, users will be able to reproduce web server results from their local computers using the R command history downloaded from ExpressAnalystR. Running the R functions will allow more flexibility and reproducibility.
Note - ExpressAnalystR is still under development - we cannot guarantee full functionality
To use ExpressAnalystR, make sure your R version is >4.0.3 and install all package dependencies. Ensure that you are able to download packages from Bioconductor. To install package dependencies, use the pacman R package. Note that some of these packages may require additional library dependencies that need to be installed prior to their own successful installation.
install.packages("pacman")
library(pacman)
pacman::p_load(igraph, RColorBrewer, qs, rjson, RSQLite)
ExpressAnalystR is freely available from GitHub. The package documentation, including the vignettes for each module and user manual is available within the downloaded R package file. If all package dependencies were installed, you will be able to install the ExpressAnalystR.
Install the package directly from github using the devtools package. Open R and enter:
# Step 1: Install devtools
install.packages('devtools')
library(devtools)
# Step 2: Install ExpressAnalystR WITHOUT documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual", "--no-build-vignettes"))
# Step 2: Install ExpressAnalystR WITH documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)
Init.Data
function, which initiates R objects that stores
user’s data, parameters for further processing and analysis.?ExpressAnalystR::ReadTabExpression
to find out more about
this function.dataSet <- readDataset(fileName)
Before you start, please download the example
dataset. It is a microarray gene expression data of a human
breast-cancer cell line.
#### 1.1 Load ExpressAnalystR library
and initialize R objects
library(ExpressAnalystR)
#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);
# Set analysis type to single gene expression matrix
SetAnalType("onedata");
dataSets <- ReadTabExpressData("estrogen.txt");
For this step it is important to please select correct organism, data type (array or RNA-seq), id type and gene-level summary (mean, median, sum). For gene-level summary, microarray can use mean or median while RNA-seq needs to be sum.
It is also possible to keep id type "unspecified" if it is not supported. In this case, differential expression analysis can still be performed but functional analysis will not be available.
dataSets <- PerformDataAnnot("estrogen.txt", "hsa", "array", "hgu95av2", "mean");
Take a look at the mapped dataset by reading the dataset’s qs file
using readDataset(fileName)
function.
dataSet <- readDataset("estrogen.txt")
print(head(dataSet$data.anot[c(1:6),]))
low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel
5875 9.216562 9.290259 9.064545 8.958936 9.222260 9.222435
5595 10.398169 10.254362 10.003971 9.903528 10.374866 10.033520
7075 5.717613 5.881008 5.859563 5.954028 5.960540 6.020889
1557 6.595252 6.828249 6.625206 6.664867 6.562788 6.642651
643 7.581658 7.771235 7.672983 7.813411 7.839115 7.980552
Check diagnostics plots to look at overall data distribution, sample separation.
PlotDataBox("estrogen.txt", "qc_boxplot_", 72, "png");
PlotDataPCA("estrogen.txt", "qc_pca_", 72, "png");
Check your working directory for png images named
qc_boxplot_dpi72.png
and qc_pca_dpi72.png
,
open them.
Box plot shows that the expression distribution
of samples are between around -4 to 12.5. This shows that the data has
already been normalized.
PCA plot shows sample separation both between absent
and present, and also, low and high. Depending of your experimental
design, try to see if the samples are separated by the metadata of
interest, it can also be used to see whether there are potentially
mislabed sample.
No normalization need to be performed, PCA plot from previous step shows that the dataset is already normalized.
dataSets <- PerformNormalization("estrogen.txt", "none", 15, 5);
Selected metadata of interest, in this case we are interested in
investigating the presence of Estrogen Receptor (ER) vs
absence. After selecting metadata, the next step is to compute design matrix and select DE analysis
algorithm by running SetupDesignMatrix
function. For
microarray data, only limma
can be used.
dataSets <- SetSelectedMetaInfo("estrogen.txt","ER", "NA", F);
dataSets <- SetupDesignMatrix("estrogen.txt", "limma");
Fold change is log2 transformed. Adjusted P-value using False Discovery Rate (FDR) method.
dataSets <- PerformDEAnal("estrogen.txt", "custom", "absent vs. present", "NA", "intonly");
dataSet <- readDataset("estrogen.txt");
print(head(dataSet$comp.res));
logFC AveExpr t P.Value adj.P.Val B
5111 -2.235524 9.136833 -15.01586 3.213962e-08 0.0001625112 9.203192
7083 -2.898336 9.850896 -13.95419 6.517279e-08 0.0001625112 8.612374
4605 -2.924256 8.532099 -13.66086 7.991986e-08 0.0001625112 8.438401
7031 -3.198764 12.115778 -13.08278 1.208922e-07 0.0001625112 8.080844
2717 -1.581543 8.709900 -12.79062 1.499605e-07 0.0001625112 7.892350
3399 1.496948 11.529434 12.76829 1.524784e-07 0.0001625112 7.877719
PlotSelectedGene("estrogen.txt","5111");
Check the resulting png image (Gene_5111.png) in your working
directory.
Before you start, please download the example datasets into your working directory E-GEOD-25713, E-GEOD-59276.txt, GSE69588.txt. These three testing datasets (containing subset of 5000 genes) are from a meta-analysis of helminth infections in mouse liver.
library(ExpressAnalystR)
#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);
# Set analysis type to meta-analysis
SetAnalType("metadata");
#Read dataset text file
dataSets <- ReadOmicsData("E-GEOD-25713.txt");
dataSets <- SanityCheckData("E-GEOD-25713.txt");
#Map gene id to entrez id
dataSets <- AnnotateGeneData("E-GEOD-25713.txt", "mmu", "entrez");
Visually inspect dataset using box plot
(qc_boxplot_0_dpi72.png
) and pca plot
(qc_pca_0_dpi72.png
).
PlotDataProfile("E-GEOD-25713.txt", "raw", "qc_boxplot_0_", "qc_pca_0_");
#Remove variables with more than 50% missing data
dataSets <- RemoveMissingPercent("E-GEOD-25713.txt", 0.5);
#Replace missing value with minimum values across dataset
dataSets <- ImputeMissingVar("E-GEOD-25713.txt", "min");
#Replace missing value with minimum values across dataset
dataSets <- NormalizeDataMetaMode("E-GEOD-25713.txt", "NA");
dataSets <- PerformDEAnalMeta("E-GEOD-25713.txt", "limma", "CLASS", 0.05, 0.0);
#read and process the other two datasets
dataSets <- ReadOmicsData("E-GEOD-59276.txt");
dataSets <- SanityCheckData("E-GEOD-59276.txt");
dataSets <- AnnotateGeneData("E-GEOD-59276.txt", "mmu", "entrez");
dataSets <- RemoveMissingPercent("E-GEOD-59276.txt", 0.5)
dataSets <- ImputeMissingVar("E-GEOD-59276.txt", "min")
dataSets <- NormalizeDataMetaMode("E-GEOD-59276.txt", "NA");
dataSets <- PerformDEAnalMeta("E-GEOD-59276.txt", "limma", "CLASS", 0.05, 0.0);
dataSets <- ReadOmicsData("GSE69588.txt");
dataSets <- SanityCheckData("GSE69588.txt");
dataSets <- AnnotateGeneData("GSE69588.txt", "mmu", "entrez");
dataSets <- RemoveMissingPercent("GSE69588.txt", 0.5)
dataSets <- ImputeMissingVar("GSE69588.txt", "min")
dataSets <- NormalizeDataMetaMode("GSE69588.txt", "NA");
dataSets <- PerformDEAnalMeta("GSE69588.txt", "limma", "CLASS", 0.05, 0.0);
CheckMetaDataIntegrity();
PlotMetaPCA("qc_meta_pca_","72", "png", "");
There is clear signs of batch effect. The
samples from same dataset are largely clustered together. To remove the
batch effect, we need to run comBat batch correction algorithm
#Apply batch effect correction
PerformBatchCorrection();
#Check the result
PlotMetaPCA("qc_meta_pca_afterBatch_","72", "png", "");
Here is the result after batch correction.
analSet <- PerformPvalCombination("fisher", 0.05)
analSet <- readSet(analSet, "analSet");
print(head(analSet$meta.mat));
CombinedTstat CombinedPval
16854 89.093 2.7728e-14
246256 99.964 2.7728e-14
105855 94.751 2.7728e-14
19241 105.030 2.7728e-14
319169 94.339 2.7728e-14
16819 100.880 2.7728e-14
For a more detailed table containing additionally log fold change and p-values of features for individual dataset, please check this csv file meta_sig_genes_metap.csv, it is also generated in your working directory.
Xia Lab @ McGill  (last updated 2023-04-21) |
You will be logged off in seconds.
Do you want to continue your session? |