5 Refinement of gene regulatory network
The regulon object constructed in the previous chapter contains all possible links between transcription factors and target genes. As a result, the GRN can be large and may contain some false connections. To refine it, we need to apply filters that retain only those connections supported by our dataset. One such filter has already been applied in the calculateP2G function, which not only searches for regulatory elements within a specified distance from each target gene, but also computes the empirical p-value of the correlation for each target gene–regulatory element pair. Pairs are retained if their p-value is below the threshold specified by the cutoff_sig argument.
In this chapter, we introduce three additional filters. The pruneRegulon function filters the GRN based on the co-occurrence of TF gene expression, RE chromatin accessibility and target gene expression. The addMotifScore function annotates regulons for the presence of motifs. Lastly, the addLogFC function calculates changes in gene expression between conditions, so that only target genes showing differential expression are selected.
5.1 Prune network
Epiregulon prunes the network by performing tests of independence on the observed number of cells jointly expressing transcription factor (\(TF\)), regulatory element (\(RE\)) and target gene (\(TG\)). The thresholds at which the features are considered expressed are defined by the peak_cutoff and exp_cutoff arguments. We implement two tests, the binomial test and the chi-square test. In the binomial test, the expected probability is \(P(TF, RE) \cdot P(TG)\), and the number of trials is the total number of cells (or metacells if we set aggregateCells=TRUE), and the observed successes is the number of cells jointly expressing all three elements. In the chi-square test, the expected probability for having all 3 elements active is also \(P(TF, RE) \cdot P(TG)\). The observed cell count for the active category is the number of cells jointly expressing all three elements (\(n_{triple}\)).
pruned.regulon <- pruneRegulon(expMatrix = GeneExpressionMatrix,
exp_assay = "normalizedCounts",
peakMatrix = PeakMatrix,
peak_assay = "counts",
regulon = regulon,
prune_value = "pval",
regulon_cutoff = 0.05)
pruned.regulon[,c("idxATAC", "target", "tf", "corr", "pval", "stats")]## DataFrame with 2575700 rows and 6 columns
## idxATAC target tf corr pval stats
## <integer> <array> <character> <matrix> <matrix> <matrix>
## 1 135 TNFRSF18 ADNP 0.796526 1.15983e-04 14.85689
## 2 266 NADK ADNP 0.789720 2.14101e-07 26.90142
## 3 491 AL139246.3 ADNP -0.528735 1.98772e-02 5.42265
## 4 991 CA6 ADNP 0.863810 2.09927e-03 9.46065
## 5 1103 SLC25A33 ADNP -0.834313 1.01188e-03 10.80569
## ... ... ... ... ... ... ...
## 2575696 155577 SCO2 ZXDC -0.548795 3.06126e-02 4.67456
## 2575697 155577 TYMP ZXDC -0.560573 4.48926e-04 12.31669
## 2575698 155577 ODF3B ZXDC -0.549052 6.00511e-03 7.54877
## 2575699 157674 OGT ZXDC 0.694312 1.01153e-02 6.61447
## 2575700 158053 RADX ZXDC 0.727684 3.72567e-06 21.40105
5.1.1 Aggregate cells
The pruneRegulon function also supports aggregation of cells into metacells. This can substantially speed up GRN refinement if the dataset is large (>100K cells). Ideally, cell aggregation should also reduce noise while still maintaining the underlying biological signals. Since the data is less sparse after aggregation, users may want to adjust the feature cutoffs to better differentiate between cell aggregates with low and high expression. Setting exp_cutoff and peak_cutoff to NULL allows pruneRegulon to calculate the mean values for each feature and use these as the thresholds.
pruned.regulon.aggr <- pruneRegulon(expMatrix = GeneExpressionMatrix,
exp_assay = "normalizedCounts",
peakMatrix = PeakMatrix,
peak_assay = "counts",
regulon = regulon,
prune_value = "pval",
regulon_cutoff = 0.05,
aggregateCells = TRUE,
useDim = "LSI_RNA",
cellNum = 10,
exp_cutoff = NULL,
peak_cutoff = NULL)5.2 Annotate with TF motifs
So far, the gene regulatory network has been constructed exclusively from TF ChIP-seq data. Some users may want to further annotate regulatory elements with motifs. We provide an option to annotate peaks with motifs from the CIS_BP database. If no motifs are present for a particular factor (e.g., for co-factors or chromatin modifiers), NA is returned. If motifs exist for a factor and the regulatory element contains a motif, 1 is returned; if the regulatory element does not contain a motif, 0 is returned. Users can also provide their own motif annotation through the pwms argument.
It is important to note that filtering for the presence of motifs removes a large fraction of the target genes. Motifs are often present in a small subset of the ChIP-seq peaks (can be as low as 10%). Second, indirect TF binding, possibly through its interaction partners, may have a true biological function.
pruned.regulon.motif <- addMotifScore(regulon = pruned.regulon,
peaks = rowRanges(PeakMatrix),
species = "human",
genome = "hg38")
# filter out the rows where TF motif is not found in the peak region
pruned.regulon.motif <- pruned.regulon.motif[which(pruned.regulon.motif$motif == 1),]
pruned.regulon.motif[,c("idxATAC", "target", "tf", "corr", "p_val_peak_gene", "pval", "motif")]## DataFrame with 192760 rows and 7 columns
## idxATAC target tf corr p_val_peak_gene pval motif
## <integer> <array> <character> <matrix> <matrix> <matrix> <numeric>
## 1 199 ACAP3 AHR -0.628315 0.01717839 2.44093e-02 1
## 2 723 TNFRSF25 AHR 0.771313 0.02373300 3.79686e-08 1
## 3 1100 SLC25A33 AHR -0.726697 0.00437233 2.41562e-08 1
## 4 3937 RNF19B AHR -0.715055 0.00534396 1.14941e-02 1
## 5 4494 PPT1 AHR 0.718666 0.03720643 2.44180e-05 1
## ... ... ... ... ... ... ... ...
## 192756 157055 CFP ZSCAN16 0.829179 0.0130202 0.009152565 1
## 192757 33086 GTPBP8 ZSCAN4 0.773445 0.0230119 0.000171032 1
## 192758 40337 PYURF ZSCAN4 0.701525 0.0431397 0.019287955 1
## 192759 104541 ZNF84 ZSCAN4 0.745546 0.0295838 0.034884124 1
## 192760 115841 PARP16 ZSCAN4 0.687043 0.0487639 0.003539738 1
5.3 Annotate with log fold changes
It is sometimes helpful to filter the regulons based on gene expression changes between two conditions. The addLogFC function is a wrapper around scran::findMarkers and adds extra columns of log changes to the regulon DataFrame. Users can specify the reference group in logFC_ref and comparison groups in logFC_condition. If these are not provided, log fold changes are calculated for every condition in the cluster labels against the rest of the conditions.
# create logcounts
GeneExpressionMatrix <- scuttle::logNormCounts(GeneExpressionMatrix)
# add log fold changes which are calculated by taking the difference of the log counts
pruned.regulon <- addLogFC(regulon = pruned.regulon,
clusters = GeneExpressionMatrix$cell_type,
expMatrix = GeneExpressionMatrix,
assay.type = "logcounts",
pval.type = "any",
logFC_condition = unique(GeneExpressionMatrix$cell_type))
pruned.regulon[,c("idxATAC", "target", "tf", "Memory CD8+ T.vs.rest.logFC", "B.vs.rest.FDR")]## DataFrame with 2575700 rows and 5 columns
## idxATAC target tf Memory CD8+ T.vs.rest.logFC B.vs.rest.FDR
## <integer> <array> <character> <numeric> <numeric>
## 1 135 TNFRSF18 ADNP -0.0906689 2.12607e-23
## 2 266 NADK ADNP -0.3582261 1.86039e-71
## 3 491 AL139246.3 ADNP -0.0156938 2.86752e-02
## 4 991 CA6 ADNP -0.3802831 2.38300e-91
## 5 1103 SLC25A33 ADNP -0.2311624 5.77745e-26
## ... ... ... ... ... ...
## 2575696 155577 SCO2 ZXDC -0.6694588 1.26531e-315
## 2575697 155577 TYMP ZXDC -2.3157151 0.00000e+00
## 2575698 155577 ODF3B ZXDC -0.3380033 2.16867e-129
## 2575699 157674 OGT ZXDC 0.5698316 3.41241e-56
## 2575700 158053 RADX ZXDC 0.0456558 1.51853e-04
TFs do not necessarily alter the expression of the genes next to where they bind, and thus filtering for target genes that do show differential expression can help refine the regulons. Since we do not have a defined experimental setup in this dataset, we retain target genes that show differential expression in any cell types vs. the rest.