4 Advanced workflow with MultiAssayExperiment
In this chapter, we illustrate the epiregulon workflow starting from data in the form of SingleCellExperiment
objects using the Wilcoxon weight estimation method.
This is a dataset of hematopoiesis from the ArchR tutorial. Prior to using epiregulon, this dataset has been fully preprocessed in ArchR, and converted to a MultiAssayExperiment
using epireglon.archr::archr2MAE
. The MAE object was uploaded to scMultiome
for full reproducibility. In this dataset, scRNAseq and scATACseq were unpaired and integrated by the ArchR::addGeneIntegrationMatrix
function.
4.1 Data preparation
Download the example dataset from scMultiome package
mae <- scMultiome::hematopoiesis()
# Load peak matrix
PeakMatrix <- mae[["PeakMatrix"]]
# Load expression matrix
GeneExpressionMatrix <- mae[["GeneIntegrationMatrix"]]
# Add gene symbols to rownames
rownames(GeneExpressionMatrix) <- rowData(GeneExpressionMatrix)$name
# Transfer dimensionality reduction matrix to GeneExpression
reducedDim(GeneExpressionMatrix, "IterativeLSI") <-
reducedDim(mae[['TileMatrix500']], "IterativeLSI")
reducedDim(GeneExpressionMatrix, "UMAP") <-
reducedDim(mae[['TileMatrix500']], "UMAP")
Visualize the data
scater::plotReducedDim(GeneExpressionMatrix,
dimred = "UMAP",
text_by = "Clusters2",
colour_by = "Clusters2",
point_size = 0.3,
point_alpha = 0.3)
4.2 Retrieve bulk TF ChIP-seq binding sites
First, we retrieve the information of TF binding sites collected from Cistrome and ENCODE ChIP-seq. Currently, human genomes hg19 and hg38 and mouse genome mm10 are available
## see ?scMultiome and browseVignettes('scMultiome') for documentation
## loading from cache
## GRangesList object of length 1558:
## $AEBP2
## GRanges object with 2761 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 10001-10446 *
## [2] chr1 877485-877780 *
## [3] chr1 919866-920161 *
## [4] chr1 2985496-2985846 *
## [5] chr1 2985975-2986514 *
## ... ... ... ...
## [2757] chrY 8333302-8333771 *
## [2758] chrY 13842450-13842966 *
## [2759] chrY 13868154-13868670 *
## [2760] chrY 21464547-21465020 *
## [2761] chrY 22147548-22147868 *
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
##
## ...
## <1557 more elements>
4.3 Link ATACseq peaks to target genes
Next, we compute peak to gene correlations using a custom algorithm that has similar performance to ArchR’s P2G function. Wherever possible, use a multidimensional dimensionality reduction matrix such as LSI or PCA instead of UMAP or TSNE since the former provides a more accurate estimate of cell similarity.
set.seed(1010)
p2g <- calculateP2G(peakMatrix = PeakMatrix,
expMatrix = GeneExpressionMatrix,
exp_assay = "normalizedCounts",
reducedDim = reducedDim(GeneExpressionMatrix, "IterativeLSI"))
## Using epiregulon to compute peak to gene links...
## performing k means clustering to form metacells
## Computing correlation
## DataFrame with 12952 rows and 8 columns
## idxATAC chr start end idxRNA target Correlation
## <integer> <character> <integer> <integer> <integer> <array> <matrix>
## 1 7 chr1 801002 801502 2 LINC00115 0.672342
## 2 8 chr1 805039 805539 6 KLHL17 0.570329
## 3 24 chr1 894453 894953 6 KLHL17 0.605406
## 4 37 chr1 935289 935789 8 HES4 0.553607
## 5 42 chr1 948574 949074 6 KLHL17 0.563872
## ... ... ... ... ... ... ... ...
## 12948 146390 chr22 50980758 50981258 12082 ODF3B 0.818101
## 12949 146390 chr22 50980758 50981258 12088 MAPK8IP2 0.606537
## 12950 146403 chr22 51021154 51021654 12078 LMF2 0.505536
## 12951 146403 chr22 51021154 51021654 12089 ARSA 0.587286
## 12952 146412 chr22 51110826 51111326 12090 SHANK3 0.589347
## distance
## <integer>
## 1 38099
## 2 90427
## 3 1013
## 4 0
## 5 52606
## ... ...
## 12948 9749
## 12949 57872
## 12950 75018
## 12951 44946
## 12952 1743
4.4 Add TF motif binding to peaks
The next step is to add the TF motif binding information by overlapping the regions of the peak matrix with the bulk chip-seq database.
## Computing overlap...
## Success!
## idxATAC idxTF tf
## 1013 7 13 ARNT
## 1014 7 17 ATF2
## 1015 7 21 ATF7
## 1016 7 28 BCL6
## 1017 7 30 BCOR
## 1018 7 32 BHLHE40
4.5 Generate regulons
A long format dataframe, representing the inferred regulons, is then generated. The dataframe consists of three columns:
- tf (transcription factor)
- target gene
- peak to gene correlation between tf and target gene
## DataFrame with 2040269 rows and 10 columns
## idxATAC chr start end idxRNA target
## <integer> <character> <integer> <integer> <integer> <character>
## 1 7 chr1 801002 801502 2 LINC00115
## 2 7 chr1 801002 801502 2 LINC00115
## 3 7 chr1 801002 801502 2 LINC00115
## 4 7 chr1 801002 801502 2 LINC00115
## 5 7 chr1 801002 801502 2 LINC00115
## ... ... ... ... ... ... ...
## 2040265 146412 chr22 51110826 51111326 12090 SHANK3
## 2040266 146412 chr22 51110826 51111326 12090 SHANK3
## 2040267 146412 chr22 51110826 51111326 12090 SHANK3
## 2040268 146412 chr22 51110826 51111326 12090 SHANK3
## 2040269 146412 chr22 51110826 51111326 12090 SHANK3
## distance idxTF tf corr
## <integer> <integer> <character> <matrix>
## 1 38099 13 ARNT 0.672342
## 2 38099 17 ATF2 0.672342
## 3 38099 21 ATF7 0.672342
## 4 38099 28 BCL6 0.672342
## 5 38099 30 BCOR 0.672342
## ... ... ... ... ...
## 2040265 1743 1172 POLR2AphosphoS5 0.589347
## 2040266 1743 1312 ZNF16 0.589347
## 2040267 1743 1461 ZNF600 0.589347
## 2040268 1743 1490 ZNF687 0.589347
## 2040269 1743 1520 ZNF777 0.589347
4.6 Prune network
Epiregulon prunes the network by performing tests of independence on the observed number of cells jointly expressing transcription factor (TF), regulatory element (RE) and target gene (TG) vs the expected number of cells if TF/RE and TG are independently expressed. We implement two tests, the binomial test and the chi-square test. In the binomial test, the expected probability is P(TF, RE) * P(TG), and the number of trials is the total number of cells, and the observed successes is the number of cells jointly expressing all three elements. In the chi-square test, the expected probability for having all 3 elements active is also P(TF, RE) * P(TG) and the probability otherwise is 1- P(TF, RE) * P(TG). The observed cell count for the active category is the number of cells jointly expressing all three elements, and the cell count for the inactive category is n - n_triple.
We calculate cluster-specific p-values if users supply cluster labels. This is useful if we are interested in cluster-specific networks. The pruned regulons can then be used to visualize differential networks for transcription factors of interest. See section on differential networks.
pruned.regulon <- pruneRegulon(expMatrix = GeneExpressionMatrix,
exp_assay = "normalizedCounts",
peakMatrix = PeakMatrix,
peak_assay = "counts",
regulon = regulon,
prune_value = "pval",
regulon_cutoff = 0.05,
clusters = GeneExpressionMatrix$Clusters2)
## pruning network with chi.sq tests using a regulon cutoff of pval<0.05
## pruning regulons
4.7 Add Weights
While the pruneRegulon
function provides statistics on the joint occurrence of TF-RE-TG, we would like to further estimate the strength of regulation. Biologically, this can be interpreted as the magnitude of gene expression changes induced by transcription factor activity. Epiregulon estimates the regulatory potential using one of the three measures: 1) correlation between TG and TF or between TG and the product of TF and RE, 2) mutual information between TG and TF expression or between TG and the product of TF and RE, or 3) Wilcoxon test statistics of target gene expression in cells jointly expressing all 3 elements vs cells that do not.
Two measures (correlation and Wilcoxon) give both the magnitude and directionality of changes whereas mutational information is always positive. The correlation and mutual information statistics are computed on grouped pseudobulks by user-supplied cluster labels and yield a single weight across all clusters per each TF-RE-target triplet. In contrast, the Wilcoxon method group cells based on the joint expression of TF, RE and TG in each single cell or in cell aggregates. If cell labels are provided, we calculate cluster-specific weights in addition to estimating weights from all the cells. Cell aggregation uses a default value of 10 cells and can help overcome sparsity and speed up computation. If cluster labels are provided, we can obtain weights of individual clusters and all cells combined.
In this example, we apply Wilcoxon test on cell aggregates of 10 cells. We use the Wilcoxon weight method because we are interested in computing cell type-specific weights.
set.seed(1010)
regulon.w <- addWeights(regulon = pruned.regulon,
expMatrix = GeneExpressionMatrix,
exp_assay = "normalizedCounts",
peakMatrix = PeakMatrix,
peak_assay = "counts",
clusters = GeneExpressionMatrix$Clusters2,
aggregateCells = TRUE,
method = "wilcox",
useDim = "IterativeLSI")
## adding weights using wilcoxon...
## performing pseudobulk using an average of 10 cells
## DataFrame with 417041 rows and 14 columns
## idxATAC chr start end idxRNA target
## <integer> <character> <integer> <integer> <integer> <character>
## 1 732 chr1 8021367 8021867 95 UTS2
## 2 891 chr1 9223922 9224422 107 H6PD
## 3 1154 chr1 11724524 11725024 135 FBXO6
## 4 1476 chr1 16003338 16003838 181 DDI2
## 5 2329 chr1 25237599 25238099 295 RUNX3
## ... ... ... ... ... ... ...
## 417037 134071 chr19 41768840 41769340 9199 TGFB1
## 417038 135998 chr19 54711800 54712300 9580 MBOAT7
## 417039 144131 chr22 30838237 30838737 11838 SF3A1
## 417040 144587 chr22 37297198 37297698 11901 CSF2RB
## 417041 144587 chr22 37297198 37297698 11904 MPST
## distance idxTF tf corr pval
## <integer> <integer> <character> <matrix> <matrix>
## 1 48072 1030 ADNP 0.534911 1.11552e-06:1:0.00021207:...
## 2 70440 1030 ADNP 0.580146 2.42572e-53:1:1.00000000:...
## 3 373 1030 ADNP 0.504093 1.70871e-07:1:0.96772981:...
## 4 59384 1030 ADNP 0.701968 2.50344e-04:1:1.00000000:...
## 5 53512 1030 ADNP 0.859918 7.39278e-14:1:1.00000000:...
## ... ... ... ... ... ...
## 417037 90490 473 ZSCAN29 0.621256 7.23903e-01:1:1:...
## 417038 18066 473 ZSCAN29 0.797633 4.26510e-02:1:1:...
## 417039 85323 473 ZSCAN29 0.541202 2.76179e-05:1:1:...
## 417040 11976 473 ZSCAN29 0.676467 1.12773e-03:1:1:...
## 417041 117984 473 ZSCAN29 0.551815 1.81933e-02:1:1:...
## stats qval
## <matrix> <matrix>
## 1 23.7177:0:13.72100224:... 1.00000e+00:1:1:...
## 2 236.3764:0: 0.00000000:... 4.49763e-47:1:1:...
## 3 27.3374:0: 0.00163667:... 2.89349e-01:1:1:...
## 4 13.4096:0: 0.00000000:... 1.00000e+00:1:1:...
## 5 55.9609:0: 0.00000000:... 1.29743e-07:1:1:...
## ... ... ...
## 417037 0.124784:0:0:... 1:1:1:...
## 417038 4.109182:0:0:... 1:1:1:...
## 417039 17.575117:0:0:... 1:1:1:...
## 417040 10.605155:0:0:... 1:1:1:...
## 417041 5.577442:0:0:... 1:1:1:...
## weight
## <matrix>
## 1 0.110121:-0.1840694:0.132197:...
## 2 0.328729: 0.0954775:0.218160:...
## 3 0.142351: 0.1861417:0.180140:...
## 4 0.175202: 0.0000000:0.000000:...
## 5 0.455341: 0.0000000:0.302052:...
## ... ...
## 417037 0.124462: 0.182873:0:...
## 417038 0.226807:-0.109380:0:...
## 417039 0.256050:-0.198350:0:...
## 417040 0.250945: 0.000000:0:...
## 417041 0.275493: 0.000000:0:...
4.8 Calculate TF activity
Finally, the activities for a specific TF in each cell are computed by averaging the weighted expressions of target genes linked to the TF weighted. \[y=\frac{1}{n}\sum_{i=1}^{n} x_i * weight_i\] where \(y\) is the activity of a TF for a cell \(n\) is the total number of targets for a TF \(x_i\) is the log count expression of target i where i in {1,2,…,n} \(weight_i\) is the weight of TF and target i
score.combine <- calculateActivity(expMatrix = GeneExpressionMatrix,
regulon = regulon.w,
mode = "weight",
method = "weightedMean",
exp_assay = "normalizedCounts")
## calculating TF activity from regulon using weightedmean
## Warning in calculateActivity(expMatrix = GeneExpressionMatrix, regulon =
## regulon.w, : The weight column contains multiple subcolumns but no cluster
## information was provided. Using first column to compute activity...
## aggregating regulons...
## creating weight matrix...
## calculating activity scores...
## normalize by the number of targets...
## 5 x 5 sparse Matrix of class "dgCMatrix"
## scATAC_BMMC_R1#TTATGTCAGTGATTAG-1 scATAC_BMMC_R1#AAGATAGTCACCGCGA-1
## ADNP 0.15753057 0.2101894
## AFF1 0.19542798 0.1939089
## AFF4 0.11994675 0.3389850
## AGO1 0.18243109 0.2156398
## AGO2 0.08236242 0.3744209
## scATAC_BMMC_R1#GCATTGAAGATTCCGT-1 scATAC_BMMC_R1#TATGTTCAGGGTTCCC-1
## ADNP 0.1741788 0.1582809
## AFF1 0.5338244 0.1934617
## AFF4 0.1341601 0.1989533
## AGO1 0.1189042 0.1736376
## AGO2 0.1113447 0.2071216
## scATAC_BMMC_R1#AGTTACGAGAACGTCG-1
## ADNP 0.18664284
## AFF1 0.20511278
## AFF4 0.13569139
## AGO1 0.21700829
## AGO2 0.07640601
4.9 Differential TF activity test
We can next determine which TFs exhibit differential activities across cell clusters/groups via the findDifferentialActivity
function. This function depends on findMarkers
function from scran package.
library(epiregulon.extra)
markers <- findDifferentialActivity(activity_matrix = score.combine,
clusters = GeneExpressionMatrix$Clusters2,
pval.type = "some",
direction = "up",
logvalues = FALSE)
markers
## List of length 11
## names(11): B CD4.M CD4.N CLP Erythroid GMP Mono NK pDC PreB Progenitor
getSigGenes compiles the different test results into a single dataframe and enables user to supply their desired cutoffs for significance and variable to order by.
## Using a cutoff of 0.21 for class B for direction equal to any
## Using a cutoff of 0.43 for class CD4.M for direction equal to any
## Using a cutoff of 0.42 for class CD4.N for direction equal to any
## Using a cutoff of 0.22 for class CLP for direction equal to any
## Using a cutoff of 0.3 for class Erythroid for direction equal to any
## Using a cutoff of 0.22 for class GMP for direction equal to any
## Using a cutoff of 0.8 for class Mono for direction equal to any
## Using a cutoff of 0.44 for class NK for direction equal to any
## Using a cutoff of 0.14 for class pDC for direction equal to any
## Using a cutoff of 0.25 for class PreB for direction equal to any
## Using a cutoff of 0.19 for class Progenitor for direction equal to any
## p.value FDR summary.diff class tf
## 8 0.000000e+00 0.000000e+00 0.5742061 B NFKB2
## 5 0.000000e+00 0.000000e+00 0.4691837 B RELB
## 9 0.000000e+00 0.000000e+00 0.4664992 B NFATC1
## 13 0.000000e+00 0.000000e+00 0.8955789 CD4.M TSHZ1
## 20 0.000000e+00 0.000000e+00 0.8746484 CD4.M MAF
## 16 0.000000e+00 0.000000e+00 0.7291512 CD4.M GPN1
## 7 0.000000e+00 0.000000e+00 0.7749790 CD4.N NCOA3
## 51 0.000000e+00 0.000000e+00 0.6316421 CD4.N HEXIM1
## 10 0.000000e+00 0.000000e+00 0.6265538 CD4.N TSHZ1
## 2 0.000000e+00 0.000000e+00 0.3386415 CLP ZNF189
## 1 0.000000e+00 0.000000e+00 0.2383423 CLP ERG
## 3 1.363478e-195 3.757746e-194 0.2776740 CLP ATF7
## 11 0.000000e+00 0.000000e+00 0.5034554 Erythroid DACH1
## 52 0.000000e+00 0.000000e+00 0.4664371 Erythroid DTL
## 71 0.000000e+00 0.000000e+00 0.3999763 Erythroid LDB1
## 131 0.000000e+00 0.000000e+00 0.4562134 GMP TBP
## 101 0.000000e+00 0.000000e+00 0.3935184 GMP HOXA9
## 31 0.000000e+00 0.000000e+00 0.3344333 GMP ZNF189
## 34 0.000000e+00 0.000000e+00 4.9076907 Mono NCOA3
## 17 0.000000e+00 0.000000e+00 4.6506703 Mono MATR3
## 21 0.000000e+00 0.000000e+00 2.4419443 Mono RARA
## 132 0.000000e+00 0.000000e+00 1.0005003 NK TSHZ1
## 14 0.000000e+00 0.000000e+00 0.9053762 NK MAF
## 171 0.000000e+00 0.000000e+00 0.7572377 NK FOXN3
## 32 0.000000e+00 0.000000e+00 0.1969361 pDC CBFA2T3
## 22 0.000000e+00 0.000000e+00 0.1877417 pDC ZNF189
## 15 0.000000e+00 0.000000e+00 0.1413879 pDC NFIA
## 18 4.382395e-226 7.548675e-224 0.9886831 PreB POU2AF1
## 23 7.989647e-216 1.100973e-213 0.3167865 PreB BACH2
## 33 2.733268e-209 2.690317e-207 0.3248628 PreB RAG1
## 12 0.000000e+00 0.000000e+00 0.8025848 Progenitor HOXA9
## 19 0.000000e+00 0.000000e+00 0.4608621 Progenitor DACH1
## 24 0.000000e+00 0.000000e+00 0.4265633 Progenitor MEIS1
4.10 Visualizing TF activities
Epiregulon also provides the option for visualizing the inferred TF activities in reduced dimensions.
options(ggrastr.default.dpi=300)
tfs_interest <- c("EBF1","PAX5", "GATA3","SPI1")
plotActivityDim(sce = GeneExpressionMatrix,
activity_matrix = score.combine[tfs_interest,],
tf = tfs_interest,
dimtype = "UMAP",
nrow=2,
ncol=2,
point_size=0.1,
rasterise = TRUE)
We can compare the activity with gene expression of the same TFs.
plotActivityDim(sce = GeneExpressionMatrix,
activity_matrix = assay(GeneExpressionMatrix, "normalizedCounts")[tfs_interest,],
tf = tfs_interest,
dimtype = "UMAP",
nrow=2,
ncol=2,
legend.label = "Gex",
colors = c("grey","blue"),
point_size=0.1,
rasterise = TRUE)
We can also plot violin plot to visualize TF activity.
plotActivityViolin(activity_matrix = score.combine,
tf = tfs_interest,
clusters = GeneExpressionMatrix$Clusters2,
legend.label = "Gex",
nrow=2,
ncol=2)
We plot violin plot to visualize TF gene expression.
plotActivityViolin(activity_matrix = assay(GeneExpressionMatrix, "normalizedCounts")[tfs_interest,],
tf = tfs_interest,
clusters = GeneExpressionMatrix$Clusters2,
nrow=2,
ncol=2,
legend.label = "gene expression")
We can visualize the different TFs in a bubble plot:
plotBubble(activity_matrix = score.combine,
tf = tfs_interest,
GeneExpressionMatrix$Clusters2,
pval.type = "some",
direction = "up",
logvalues = FALSE)
We visualize the top differential TFs based on activity.
plotBubble(activity_matrix = score.combine,
tf = markers.sig$tf,
GeneExpressionMatrix$Clusters2,
pval.type = "some",
direction = "up",
logvalues = FALSE)
4.11 Geneset enrichment
Sometimes we are interested to know what pathways are enriched in the regulon of a particular TF. We can perform geneset enrichment using the enricher function from clusterProfiler.
Here we first download Hallmark and C2 signatures from hallmark and then perform gene set enrichment of the known lineage factors. As expected, EBF1 is consistent with a B cell lineage factor, GATA3 and RUNX3 with lymphoid lineage and SPI1 with myeloid lineage.
#retrieve genesets
H <- EnrichmentBrowser::getGenesets(org = "hsa",
db = "msigdb",
cat = "H",
gene.id.type = "SYMBOL" )
C2 <- EnrichmentBrowser::getGenesets(org = "hsa",
db = "msigdb",
cat = "C2",
gene.id.type = "SYMBOL" )
#combine genesets and convert genesets to be compatible with enricher
gs <- c(H, C2)
gs.list <- do.call(rbind,lapply(names(gs), function(x)
{data.frame(gs=x, genes=gs[[x]])}))
enrichresults <- regulonEnrich(TF = tfs_interest,
regulon = regulon.w,
weight = "weight",
weight_cutoff = 0,
genesets = gs.list)
## EBF1
##
## PAX5
## GATA3
## SPI1
4.12 Differential Network analysis
In addition to looking at the summed TF activity, a second approach to investigate differential TF activity is to compare and contrast target genes or network topology. In this example, we know that EBF1 is a B cell lineage factor. If we plot the differential network of EBF1 using the regulon with cluster-specific weights, we can see that EBF1 has many more targets in PreB cells than it has in CD4 memory T cells.
plotDiffNetwork(regulon.w,
cutoff = 0,
tf = c("EBF1"),
weight = "weight",
clusters = c("PreB","CD4.M"),
layout = "stress")
## Replacement of na values for weights with 0
## Building graph using weight as edge weights
Sometimes, we are interested to identify interaction partners of the TFs of interest. This can be achieved by comparing the overlap of the targets genes for all the TFs and identify the most similar TFs by Jaccard similarity. To illustrate this function, we take a look at the top most similar 20 TFs to EBF1, and we successfully identify PAX5 as the most similar TF. Both PAX5 and EBF1 are important factors for B cell development (https://www.nature.com/articles/ni.2641).
library(ggplot2)
# construct a graph of the preB cells
preB_network <- buildGraph(regulon.w, weights = "weight", cluster="PreB")
## Building graph using weight as edge weights
# compute a similarity matrix of all TFs
similarity_score <- calculateJaccardSimilarity(preB_network)
# Focus on EBF1
similarity_score_EBF1 <- similarity_score[, "EBF1"]
similarity_df <- data.frame(similarity = head(sort(similarity_score_EBF1,
decreasing = TRUE),20),
TF = names(head(sort(similarity_score_EBF1,
decreasing = TRUE),20)))
similarity_df$TF <- factor(similarity_df$TF, levels = rev(unique(similarity_df$TF)))
# plot top TFs most similar to EBF1
topTFplot <- ggplot(similarity_df, aes(x=TF, y=similarity)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("EBF1 similarity") +
theme_classic()
print(topTFplot)
In order to convince ourselves that our differential network is statistically significant, we permute the edges and obtain a background graph from averaging many iterations. Here, we plot the differential network graph subtracted by permuted graphs.
# create a permuted graph by rewiring the edges 100 times
permute_matrix <- permuteGraph(preB_network, "EBF1", 100, p=1)
permute_matrix <- permute_matrix[names(similarity_score_EBF1),]
diff_matrix <- similarity_score_EBF1-rowMeans(permute_matrix)
diff_matrix_df <- data.frame(similarity = head(sort(diff_matrix,
decreasing = TRUE),20),
TF = names(head(sort(diff_matrix,
decreasing = TRUE),20)))
diff_matrix_df$TF <- factor(diff_matrix_df$TF, levels = rev(unique(diff_matrix_df$TF)))
# plot top TFs most similar to EBF1
topTFplot <- ggplot(diff_matrix_df, aes(x=TF, y=similarity)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("background subtracted EBF1 similarity ") +
theme_classic()
print(topTFplot)
# obtain empirical p-values
p_matrix <- rowMeans(apply(permute_matrix, 2, function(x) {x > similarity_score_EBF1}))
p_matrix[names(head(sort(diff_matrix,decreasing = TRUE),20))]
## PAX5 IRF4 TCF3 RAG2 TCF12 POU2F2 NIPBL RB1 TCF4 RAG1 TRIM22
## 0 0 0 0 0 0 0 0 0 0 0
## MYB CREB1 EZH2 KDM2B FOXP1 CREM EP300 CREBBP FOXO1
## 0 0 0 0 0 0 0 0 0
Next, we are interested to compare the networks of two cell types, in this case, CD4 memory T cells (CD4.M) vs Monocytes (mono) cells. We build an edge subtracted graph and then calculate the degree centrality of the subtracted graph. We normalize centrality using the default square root function. The top 5 most positive TFs represent lineage factors more active in NK cells whereas the bottom 5 TFs present lineage factors enriched in CD4. We successfully identified the myeloid factor SPI1 to be associated with monocytes and Th1 factor TBX21 to be associated with CD4 T cells.
#regulon.w.2 <- regulon.w
#regulon.w <- readRDS("/gstore/project/lineage/manuscript/epiregulon/OUTPUT/regulon.w.rds")
# construct a graph of the CD4.M and NK cells respectively
CD4.M_network <- buildGraph(regulon.w, weights = "weight", cluster="CD4.M")
## Building graph using weight as edge weights
## Building graph using weight as edge weights
# construct a difference graph
diff_graph <- buildDiffGraph(Mono_network,CD4.M_network, abs_diff = FALSE)
diff_graph <- addCentrality(diff_graph)
diff_graph <- normalizeCentrality(diff_graph)
rank_table <- rankTfs(diff_graph)
library(ggplot2)
ggplot(rank_table, aes(x = rank, y = centrality)) +
geom_point() +
ggrepel::geom_text_repel(data = rbind(head(rank_table, 10),
tail(rank_table, 10)),
aes(label = tf),
nudge_x = 0, nudge_y = 0, box.padding = 0.5, max.overlaps = Inf) +
theme_classic() + ggtitle ("differential TFs (Mono-CD4.M) ranked by degree centrality")
We can further explore interacting factors with the myeloid factor SPI1 using the same Jaccard similarity approach. We found CEBPA as the most similar TF as SPI1. SPI1 and CEBPA are known to be important for differentiation into myeloid cells (https://www.cell.com/cell-reports/pdfExtended/S2211-1247(18)30745-9).
diff_graph_filter <- subgraph.edges(diff_graph,
E(diff_graph)[E(diff_graph)$weight>0],
del=TRUE)
# compute a similarity matrix of all TFs
similarity_score <- calculateJaccardSimilarity(diff_graph_filter)
# Focus on SPI1
similarity_score_SPI1 <- similarity_score[, "SPI1"]
similarity_df <- data.frame(similarity = head(sort(similarity_score_SPI1,
decreasing = TRUE),20),
TF = names(head(sort(similarity_score_SPI1,
decreasing = TRUE),20)))
similarity_df$TF <- factor(similarity_df$TF,
levels = rev(unique(similarity_df$TF)))
# plot top TFs most similar to SPI1
topTFplot <- ggplot(similarity_df, aes(x=TF, y=similarity)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("SPI1 similarity") +
theme_classic()
print(topTFplot)
4.13 Session Info
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Red Hat Enterprise Linux 8.6 (Ootpa)
##
## Matrix products: default
## BLAS/LAPACK: /apps/rocs/2020.08/cascadelake/software/OpenBLAS/0.3.9-GCC-9.3.0/lib/libopenblas_skylakexp-r0.3.9.so; LAPACK version 3.9.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Los_Angeles
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] igraph_2.0.3 ggplot2_3.5.1
## [3] org.Hs.eg.db_3.20.0 AnnotationDbi_1.67.0
## [5] msigdbr_7.5.1 epiregulon.extra_1.3.1
## [7] epiregulon_1.3.4 scMultiome_1.5.7
## [9] SingleCellExperiment_1.27.2 MultiAssayExperiment_1.31.5
## [11] SummarizedExperiment_1.35.4 Biobase_2.65.1
## [13] GenomicRanges_1.57.2 GenomeInfoDb_1.41.2
## [15] IRanges_2.39.2 S4Vectors_0.43.2
## [17] MatrixGenerics_1.17.0 matrixStats_1.4.1
## [19] ExperimentHub_2.13.1 AnnotationHub_3.13.3
## [21] BiocFileCache_2.13.2 dbplyr_2.5.0
## [23] BiocGenerics_0.51.3
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.0 later_1.3.2 ggplotify_0.1.2
## [4] bitops_1.0-9 filelock_1.0.3 R.oo_1.26.0
## [7] tibble_3.2.1 polyclip_1.10-7 graph_1.83.0
## [10] XML_3.99-0.17 lifecycle_1.0.4 edgeR_4.3.19
## [13] processx_3.8.4 lattice_0.22-6 MASS_7.3-61
## [16] backports_1.5.0 magrittr_2.0.3 limma_3.61.12
## [19] sass_0.4.9 rmarkdown_2.28 jquerylib_0.1.4
## [22] yaml_2.3.10 metapod_1.13.0 RColorBrewer_1.1-3
## [25] cowplot_1.1.3 chromote_0.3.1 DBI_1.2.3
## [28] abind_1.4-8 zlibbioc_1.51.1 R.utils_2.12.3
## [31] purrr_1.0.2 ggraph_2.2.1 RCurl_1.98-1.16
## [34] yulab.utils_0.1.7 tweenr_2.0.3 rappdirs_0.3.3
## [37] GenomeInfoDbData_1.2.13 enrichplot_1.25.3 ggrepel_0.9.6
## [40] irlba_2.3.5.1 tidytree_0.4.6 BiocStyle_2.33.1
## [43] annotate_1.83.0 dqrng_0.4.1 codetools_0.2-20
## [46] DelayedArray_0.31.14 DOSE_3.99.1 scuttle_1.15.4
## [49] ggforce_0.4.2 tidyselect_1.2.1 aplot_0.2.3
## [52] UCSC.utils_1.1.0 farver_2.1.2 ScaledMatrix_1.13.0
## [55] viridis_0.6.5 jsonlite_1.8.9 BiocNeighbors_1.99.2
## [58] tidygraph_1.3.1 scater_1.33.4 tools_4.4.0
## [61] treeio_1.29.1 Rcpp_1.0.13 glue_1.8.0
## [64] gridExtra_2.3 SparseArray_1.5.44 xfun_0.48
## [67] qvalue_2.37.0 websocket_1.4.2 dplyr_1.1.4
## [70] HDF5Array_1.33.8 withr_3.0.1 BiocManager_1.30.25
## [73] fastmap_1.2.0 rhdf5filters_1.17.0 bluster_1.15.1
## [76] fansi_1.0.6 digest_0.6.37 rsvd_1.0.5
## [79] gridGraphics_0.5-1 R6_2.5.1 mime_0.12
## [82] colorspace_2.1-1 GO.db_3.20.0 Cairo_1.6-2
## [85] RSQLite_2.3.7 R.methodsS3_1.8.2 utf8_1.2.4
## [88] tidyr_1.3.1 generics_0.1.3 data.table_1.16.2
## [91] graphlayouts_1.2.0 httr_1.4.7 S4Arrays_1.5.10
## [94] scatterpie_0.2.4 pkgconfig_2.0.3 gtable_0.3.5
## [97] blob_1.2.4 XVector_0.45.0 shadowtext_0.1.4
## [100] clusterProfiler_4.13.4 htmltools_0.5.8.1 fgsea_1.31.6
## [103] bookdown_0.40 GSEABase_1.67.0 scales_1.3.0
## [106] png_0.1-8 ggfun_0.1.7 scran_1.33.2
## [109] knitr_1.48 rstudioapi_0.16.0 reshape2_1.4.4
## [112] nlme_3.1-166 checkmate_2.3.2 curl_5.2.3
## [115] cachem_1.1.0 rhdf5_2.49.0 stringr_1.5.1
## [118] BiocVersion_3.20.0 parallel_4.4.0 vipor_0.4.7
## [121] ggrastr_1.0.2 pillar_1.9.0 grid_4.4.0
## [124] vctrs_0.6.5 promises_1.3.0 BiocSingular_1.21.4
## [127] beachmat_2.21.6 xtable_1.8-4 cluster_2.1.6
## [130] beeswarm_0.4.0 Rgraphviz_2.49.1 evaluate_1.0.1
## [133] KEGGgraph_1.65.0 cli_3.6.3 locfit_1.5-9.10
## [136] compiler_4.4.0 rlang_1.1.4 crayon_1.5.3
## [139] labeling_0.4.3 ps_1.8.0 fs_1.6.4
## [142] plyr_1.8.9 ggbeeswarm_0.7.2 stringi_1.8.4
## [145] viridisLite_0.4.2 BiocParallel_1.39.0 babelgene_22.9
## [148] munsell_0.5.1 Biostrings_2.73.2 lazyeval_0.2.2
## [151] GOSemSim_2.31.2 Matrix_1.7-0 patchwork_1.3.0
## [154] bit64_4.5.2 Rhdf5lib_1.27.0 KEGGREST_1.45.1
## [157] statmod_1.5.0 highr_0.11 beachmat.hdf5_1.3.3
## [160] memoise_2.0.1 bslib_0.8.0 ggtree_3.13.1
## [163] fastmatch_1.1-4 bit_4.5.0 EnrichmentBrowser_2.35.1
## [166] gson_0.1.0 ape_5.8