ExpressAnalystR is the underlying R package synchronized with ExpressAnalyst web server. It is designed for statistical analysis, enrichment analysis and visual analytics of single and multiple gene expression data, both matrix and gene list. The R package is composed of R functions necessary for the web-server to perform data annotation, normalization, differential expression and meta-analysis.
Following installation and loading of ExpressAnalystR, users will be able to reproduce web server results from their local computers using the R command history downloaded from ExpressAnalystR. Running the R functions will allow more flexibility and reproducibility.
Note - ExpressAnalystR is still under development - we cannot guarantee full functionality
To use ExpressAnalystR, make sure your R version is >4.0.3 and install all package dependencies. Ensure that you are able to download packages from Bioconductor. To install package dependencies, use the pacman R package. Note that some of these packages may require additional library dependencies that need to be installed prior to their own successful installation.
install.packages("pacman")
library(pacman)
pacman::p_load(igraph, RColorBrewer, qs, rjson, RSQLite)
ExpressAnalystR is freely available from GitHub. The package documentation, including the vignettes for each module and user manual is available within the downloaded R package file. If all package dependencies were installed, you will be able to install the ExpressAnalystR.
Install the package directly from github using the devtools package. Open R and enter:
# Step 1: Install devtools
install.packages('devtools')
library(devtools)
# Step 2: Install ExpressAnalystR WITHOUT documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual", "--no-build-vignettes"))
# Step 2: Install ExpressAnalystR WITH documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)
Init.Data
function, which initiates R objects that stores
user’s data, parameters for further processing and analysis.?ExpressAnalystR::ReadTabExpression
to find out more about
this function.dataSet <- readDataset(fileName)
Before you start, please download the example
dataset. It is a microarray gene expression data of a human
breast-cancer cell line.
#### 1.1 Load ExpressAnalystR library
and initialize R objects
library(ExpressAnalystR)
#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);
# Set analysis type to single gene expression matrix
SetAnalType("onedata");
dataName <- "endotoxin_data.txt";
dataSets <- ReadTabExpressData(dataName);
For this step it is imortant to please select correct organism, data type (array or RNA-seq), id type and gene-level summary (mean, median, sum). For gene-level summary, microarray can use mean or median while RNA-seq needs to be sum.
dataSets <- PerformDataAnnot(dataName, "hsa", "array", "refseq", "mean");
Take a look at the mapped dataset by reading the dataset’s qs file
using readDataset(fileName)
function.
dataSet <- readDataset(dataName)
print(head(dataSet$data.anot[c(1:5),]))
Check diagnostics plots to look at overall data distribution, sample separation.
PlotDataBox(dataName, "qc_boxplot_", 72, "png");
PlotDataPCA(dataName, "qc_pca_", 72, "png");
Check your working directory for png images named
qc_boxplot_dpi72.png
and qc_pca_dpi72.png
,
open them.
Box plot shows that the expression distribution
of samples are between around -4 to 12.5. This shows that the data has
already been normalized.
PCA plot shows sample separation both between absent
and present, and also, low and high. Depending of your experimental
design, try to see if the samples are separated by the metadata of
interest, it can also be used to see whether there are potentially
mislabed sample.
No normalization need to be performed, PCA plot from previous step shows that the dataset is already normalized. Filter by variance (lower 15% removed) Filter by relative abundance (lower 4 percentile of average expression signal) Filter by count not applied (only for RNASeq data) Filter unannotated genes TRUE
dataSets <- PerformNormalization("dataName", "none", 15, 4,"true","false");
Selected metadata of interest, in this case we are interested in
investigating the effect of presence of Estrogen Receptor (ER) vs
absence. We are not setting secondary factor and blocking factor. After
selecting metadata, compute design matrix and select DE analysis
algorithm by running SetupDesignMatrix
function. For
microarray data, only limma
can be used.
dataSets <- SetSelectedMetaInfo(dataName,"Treatment", "NA", F);
dataSets <- SetupDesignMatrix(dataName, "limma");
Fold change is log2 transformed. Adjusted P-value using False Discovery Rate (FDR) method.
dataSets <- PerformDEAnal(dataName, "custom", "control vs. LPS", "NA", "intonly");
dataSet <- readDataset(dataName);
print(head(dataSet$comp.res));
PlotSelectedGene(dataName,"Gene_5111_","5111", dpi=72, format="png");
Check the resulting png image (Gene_5111_dpi72.png) in your working
directory.
Before you start, please download the example datasets into your working directory E-GEOD-25713, E-GEOD-59276.txt, GSE69588.txt. These three testing datasets (containing subset of 5000 genes) are from a meta-analysis of helminth infections in mouse liver.
library(ExpressAnalystR)
#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);
# Set analysis type to meta-analysis
SetAnalType("metadata");
#Read dataset text file
dataSets <- ReadOmicsData("E-GEOD-25713.txt");
dataSets <- ReadOmicsData("E-GEOD-59276.txt");
dataSets <- ReadOmicsData("GSE69588.txt");
ReadMetaData("helminth_meta.csv");
dataSets <- SanityCheckData("E-GEOD-25713.txt");
#Map gene id to entrez id
dataSets <- AnnotateGeneData("E-GEOD-25713.txt", "mmu", "mean","entrez");
Visually inspect dataset using box plot
(qc_boxplot_0_dpi72.png
) and pca plot
(qc_pca_0_dpi72.png
).
PlotDataProfile("E-GEOD-25713.txt", "raw", "qc_boxplot_0_", "qc_pca_0_");
#Remove variables with more than 50% missing data
dataSets <- RemoveMissingPercent("E-GEOD-25713.txt", 0.5);
#Replace missing value with minimum values across dataset
dataSets <- ImputeMissingVar("E-GEOD-25713.txt", "min");
#Replace missing value with minimum values across dataset
dataSets <- FilteringData("E-GEOD-25713.txt","pct","0", "0");
dataSets <- NormalizeDataMetaMode("E-GEOD-25713.txt", "NA");
#process the other two datasets
dataSets <- SanityCheckData("E-GEOD-59276.txt");
dataSets <- AnnotateGeneData("E-GEOD-59276.txt", "mmu","mean", "entrez");
dataSets <- RemoveMissingPercent("E-GEOD-59276.txt", 0.5);
dataSets <- ImputeMissingVar("E-GEOD-59276.txt", "min");
dataSets <- FilteringData("E-GEOD-59276.txt","pct","0", "0");
dataSets <- NormalizeDataMetaMode("E-GEOD-59276.txt", "NA");
dataSets <- SanityCheckData("GSE69588.txt");
dataSets <- AnnotateGeneData("GSE69588.txt", "mmu","mean", "entrez");
dataSets <- RemoveMissingPercent("GSE69588.txt", 0.5);
dataSets <- ImputeMissingVar("GSE69588.txt", "min");
dataSets <- FilteringData("GSE69588.txt","pct","0", "0");
dataSets <- NormalizeDataMetaMode("GSE69588.txt", "NA");
CheckMetaDataIntegrity();
PlotMetaPCA("qc_meta_pca_","72", "png", "");
There is clear signs of batch effect. The
samples from same dataset are largely clustered together. To remove the
batch effect, we need to run comBat batch correction algorithm
#Apply batch effect correction
dataSets <- PerformBatchCorrection();
#Check the result
PlotMetaPCA("qc_meta_pca_afterBatch_","72", "png", "");
Here is the result after batch correction.
#### 2.5 Perform DE analysis
CovariateScatter.Anal("E-GEOD-25713.txt", "E-GEOD-25713.txt_covariate_0_.png", "png", "Condition", "CONTROL", "NA" , 0.05,"fdr", "anova")
CovariateScatter.Anal("E-GEOD-59276.txt", "E-GEOD-59276.txt_covariate_0_.png", "png", "Condition", "CONTROL", "NA" , 0.05,"fdr", "anova")
CovariateScatter.Anal("GSE69588.txt", "GSE69588.txt_covariate_0_.png", "png", "Condition", "CONTROL", "NA" , 0.05,"fdr", "anova")
PerformPvalCombination("fisher", 0.05)
analSet <- readSet(analSet, "analSet");
print(head(analSet$meta.mat));
#
# CombinedTstat CombinedPval
#16854 89.093 2.7728e-14
#246256 99.964 2.7728e-14
#105855 94.751 2.7728e-14
#19241 105.030 2.7728e-14
#319169 94.339 2.7728e-14
#16819 100.880 2.7728e-14
For a more detailed table containing additionally log fold change and p-values of features for individual dataset, please check this csv file meta_sig_genes_metap.csv, it is also generated in your working directory.
Xia Lab @ McGill (last updated 2024-08-19) |
You will be logged off in seconds.
Do you want to continue your session? |