Description

ExpressAnalystR is the underlying R package synchronized with ExpressAnalyst web server. It is designed for statistical analysis, enrichment analysis and visual analytics of single and multiple gene expression data, both matrix and gene list. The R package is composed of R functions necessary for the web-server to perform data annotation, normalization, differential expression and meta-analysis.

Following installation and loading of ExpressAnalystR, users will be able to reproduce web server results from their local computers using the R command history downloaded from ExpressAnalystR. Running the R functions will allow more flexibility and reproducibility.

Note - ExpressAnalystR is still under development - we cannot guarantee full functionality

Installation

1. Install package dependencies

To use ExpressAnalystR, make sure your R version is >4.0.3 and install all package dependencies. Ensure that you are able to download packages from Bioconductor. To install package dependencies, use the pacman R package. Note that some of these packages may require additional library dependencies that need to be installed prior to their own successful installation.

install.packages("pacman")

library(pacman)

pacman::p_load(igraph, RColorBrewer, qs, rjson, RSQLite)

2. Install the package

ExpressAnalystR is freely available from GitHub. The package documentation, including the vignettes for each module and user manual is available within the downloaded R package file. If all package dependencies were installed, you will be able to install the ExpressAnalystR.

Install the package directly from github using the devtools package. Open R and enter:

# Step 1: Install devtools
install.packages('devtools')
library(devtools)

# Step 2: Install ExpressAnalystR WITHOUT documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual", "--no-build-vignettes"))

# Step 2: Install ExpressAnalystR WITH documentation
devtools::install_github("xia-lab/ExpressAnalystR", build = TRUE, build_opts = c("--no-resave-data", "--no-manual"), build_vignettes = TRUE)

Tips for using the ExpressAnalystR package

  1. The first function that you will use in every module is the Init.Data function, which initiates R objects that stores user’s data, parameters for further processing and analysis.
  2. The ExpressAnalystR package will output data files/tables/analysis/networks outputs in your current working directory.
  3. Every function must be executed in sequence as it is shown on the R Command history, please do not skip any commands as this can result in errors downstream.
  4. Main functions in ExpressAnalystR are documented. Use the ?Function format to open its documentation. For instance, use ?ExpressAnalystR::ReadTabExpression to find out more about this function.
  5. It is recommended to set the working folder to an empty folder because numerous files will be generated through the process.
  6. R package is not useful for visual analytics as they are hosted on the website. It’s mainly useful for statistical analysis (differential expression and meta-analysis).
  7. "dataSets" is a list containing each individual dataset, used for data annotation, filtering, normalization, etc. You can also use this function to access: dataSet <- readDataset(fileName)
  8. Other useful R objects are:
    1. paramSet: contains various parameters relevant for analysis
    2. analSet: contains various statistical analysis results
    3. msgSet: contains info/error messages
    4. cmdSet: contains R command history, also found in Rhistory.R

Examples

1. Starting from a gene expression matrix

Before you start, please download the example dataset. It is a microarray gene expression data of a human breast-cancer cell line.
#### 1.1 Load ExpressAnalystR library and initialize R objects

library(ExpressAnalystR)

#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);

# Set analysis type to single gene expression matrix
SetAnalType("onedata");

1.2 Read data table

dataSets <- ReadTabExpressData("estrogen.txt");

1.3 Annotate gene IDs to Entrez

For this step it is important to please select correct organism, data type (array or RNA-seq), id type and gene-level summary (mean, median, sum). For gene-level summary, microarray can use mean or median while RNA-seq needs to be sum.

It is also possible to keep id type "unspecified" if it is not supported. In this case, differential expression analysis can still be performed but functional analysis will not be available.

dataSets <- PerformDataAnnot("estrogen.txt", "hsa", "array", "hgu95av2", "mean");

Take a look at the mapped dataset by reading the dataset’s qs file using readDataset(fileName) function.

dataSet <- readDataset("estrogen.txt")
print(head(dataSet$data.anot[c(1:6),]))
     low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel
5875    9.216562    9.290259     9.064545     8.958936    9.222260    9.222435
5595   10.398169   10.254362    10.003971     9.903528   10.374866   10.033520
7075    5.717613    5.881008     5.859563     5.954028    5.960540    6.020889
1557    6.595252    6.828249     6.625206     6.664867    6.562788    6.642651
643     7.581658    7.771235     7.672983     7.813411    7.839115    7.980552

Check diagnostics plots to look at overall data distribution, sample separation.

PlotDataBox("estrogen.txt", "qc_boxplot_", 72, "png");
PlotDataPCA("estrogen.txt", "qc_pca_", 72, "png");

Check your working directory for png images named qc_boxplot_dpi72.png and qc_pca_dpi72.png, open them.
Box Plot
Box plot shows that the expression distribution of samples are between around -4 to 12.5. This shows that the data has already been normalized.
PCA Plot
PCA plot shows sample separation both between absent and present, and also, low and high. Depending of your experimental design, try to see if the samples are separated by the metadata of interest, it can also be used to see whether there are potentially mislabed sample.

1.4 Perform data filtering and normalization

No normalization need to be performed, PCA plot from previous step shows that the dataset is already normalized.


dataSets <- PerformExpressNormalization("estrogen.txt", "none", 15, 5);

1.5 Prepare differential expression (DE) analysis

Selected metadata of interest, in this case we are interested in investigating the presence of Estrogen Receptor (ER) vs absence. After selecting metadata, the next step is to compute design matrix and select DE analysis algorithm by running SetupDesignMatrix function. For microarray data, only limma can be used.

dataSets <- SetSelectedMetaInfo("estrogen.txt","ER", "NA", F);
dataSets <- SetupDesignMatrix("estrogen.txt", "limma");

1.6 Perform DE analysis and check DE results

Fold change is log2 transformed. Adjusted P-value using False Discovery Rate (FDR) method.

dataSets <- PerformDEAnal("estrogen.txt", "custom", "absent vs. present", "NA", "intonly");
dataSet <- readDataset("estrogen.txt");
print(head(dataSet$comp.res));
         logFC   AveExpr         t      P.Value    adj.P.Val        B
5111 -2.235524  9.136833 -15.01586 3.213962e-08 0.0001625112 9.203192
7083 -2.898336  9.850896 -13.95419 6.517279e-08 0.0001625112 8.612374
4605 -2.924256  8.532099 -13.66086 7.991986e-08 0.0001625112 8.438401
7031 -3.198764 12.115778 -13.08278 1.208922e-07 0.0001625112 8.080844
2717 -1.581543  8.709900 -12.79062 1.499605e-07 0.0001625112 7.892350
3399  1.496948 11.529434  12.76829 1.524784e-07 0.0001625112 7.877719

1.7 Visualize gene expression pattern of individual gene

PlotSelectedGene("estrogen.txt","5111");

Check the resulting png image (Gene_5111.png) in your working directory.
Violin Plot

2. Starting from three datasets for meta-analysis

Before you start, please download the example datasets into your working directory E-GEOD-25713, E-GEOD-59276.txt, GSE69588.txt. These three testing datasets (containing subset of 5000 genes) are from a meta-analysis of helminth infections in mouse liver.

2.1 Load ExpressAnalystR library and initialize R objects

library(ExpressAnalystR)

#boolean FALSE indicates local mode (vs web mode);
Init.Data(FALSE);

# Set analysis type to meta-analysis
SetAnalType("metadata");

2.2 Process each individual dataset

#Read dataset text file
dataSets <- ReadOmicsData("E-GEOD-25713.txt");
dataSets <- SanityCheckData("E-GEOD-25713.txt");

#Map gene id to entrez id
dataSets <- AnnotateGeneData("E-GEOD-25713.txt", "mmu", "entrez");

Visually inspect dataset using box plot (qc_boxplot_0_dpi72.png) and pca plot (qc_pca_0_dpi72.png).

PlotDataProfile("E-GEOD-25713.txt", "raw", "qc_boxplot_0_", "qc_pca_0_");

Box Plot PCA Plot

#Remove variables with more than 50% missing data
dataSets <- RemoveMissingPercent("E-GEOD-25713.txt", 0.5);

#Replace missing value with minimum values across dataset
dataSets <- ImputeMissingVar("E-GEOD-25713.txt", "min");

#Replace missing value with minimum values across dataset
dataSets <- NormalizingDataMeta("E-GEOD-25713.txt", "NA");
dataSets <- PerformDEAnalMeta("E-GEOD-25713.txt", "limma", "CLASS", 0.05, 0.0);

#read and process the other two datasets
dataSets <- ReadOmicsData("E-GEOD-59276.txt");
dataSets <- SanityCheckData("E-GEOD-59276.txt");
dataSets <- AnnotateGeneData("E-GEOD-59276.txt", "mmu", "entrez");
dataSets <- RemoveMissingPercent("E-GEOD-59276.txt", 0.5)
dataSets <- ImputeMissingVar("E-GEOD-59276.txt", "min")
dataSets <- NormalizingDataMeta("E-GEOD-59276.txt", "NA");
dataSets <- PerformDEAnalMeta("E-GEOD-59276.txt", "limma", "CLASS", 0.05, 0.0);

dataSets <- ReadOmicsData("GSE69588.txt");
dataSets <- SanityCheckData("GSE69588.txt");
dataSets <- AnnotateGeneData("GSE69588.txt", "mmu", "entrez");
dataSets <- RemoveMissingPercent("GSE69588.txt", 0.5)
dataSets <- ImputeMissingVar("GSE69588.txt", "min")
dataSets <- NormalizingDataMeta("GSE69588.txt", "NA");
dataSets <- PerformDEAnalMeta("GSE69588.txt", "limma", "CLASS", 0.05, 0.0);

2.3 Perform data integrity check (compatibility)

CheckMetaDataIntegrity();

2.4 Check diagnostic plot and perform batch correction

PlotMetaPCA("qc_meta_pca_","72", "png", "");

PCA Plot
There is clear signs of batch effect. The samples from same dataset are largely clustered together. To remove the batch effect, we need to run comBat batch correction algorithm

#Apply batch effect correction
PerformBatchCorrection();

#Check the result 
PlotMetaPCA("qc_meta_pca_afterBatch_","72", "png", "");

Here is the result after batch correction.
PCA Plot

2.5 Perform statistical meta-analysis using combine p-values method

analSet <- PerformPvalCombination("fisher", 0.05)

2.6 View result tables

analSet <- readSet(analSet, "analSet");
print(head(analSet$meta.mat));
       CombinedTstat CombinedPval
16854         89.093   2.7728e-14
246256        99.964   2.7728e-14
105855        94.751   2.7728e-14
19241        105.030   2.7728e-14
319169        94.339   2.7728e-14
16819        100.880   2.7728e-14

For a more detailed table containing additionally log fold change and p-values of features for individual dataset, please check this csv file meta_sig_genes_metap.csv, it is also generated in your working directory.

Processing ....
Your session is about to expire!

You will be logged off in seconds.

Do you want to continue your session?