5 Refinement of gene regulatory network

The regulon object constructed in the previous chapter contains all possible links between transcription factors and target genes. As a result, the GRN can be large and may contain some false connections. To refine it, we need to apply filters that retain only those connections supported by our dataset. One such filter has already been applied in the calculateP2G function, which not only searches for regulatory elements within a specified distance from each target gene, but also computes the empirical p-value of the correlation for each target gene–regulatory element pair. Pairs are retained if their p-value is below the threshold specified by the cutoff_sig argument.

In this chapter, we introduce three additional filters. The pruneRegulon function filters the GRN based on the co-occurrence of TF gene expression, RE chromatin accessibility and target gene expression. The addMotifScore function annotates regulons for the presence of motifs. Lastly, the addLogFC function calculates changes in gene expression between conditions, so that only target genes showing differential expression are selected.

5.1 Prune network

Epiregulon prunes the network by performing tests of independence on the observed number of cells jointly expressing transcription factor (\(TF\)), regulatory element (\(RE\)) and target gene (\(TG\)). The thresholds at which the features are considered expressed are defined by the peak_cutoff and exp_cutoff arguments. We implement two tests, the binomial test and the chi-square test. In the binomial test, the expected probability is \(P(TF, RE) \cdot P(TG)\), and the number of trials is the total number of cells (or metacells if we set aggregateCells=TRUE), and the observed successes is the number of cells jointly expressing all three elements. In the chi-square test, the expected probability for having all 3 elements active is also \(P(TF, RE) \cdot P(TG)\). The observed cell count for the active category is the number of cells jointly expressing all three elements (\(n_{triple}\)).

pruned.regulon <- pruneRegulon(expMatrix = GeneExpressionMatrix,
                               exp_assay = "normalizedCounts",
                               peakMatrix = PeakMatrix,
                               peak_assay = "counts",
                               regulon = regulon,
                               prune_value = "pval",
                               regulon_cutoff = 0.05)

pruned.regulon[,c("idxATAC", "target", "tf", "corr", "pval", "stats")]

## DataFrame with 2575700 rows and 6 columns
##           idxATAC     target          tf      corr        pval    stats
##         <integer>    <array> <character>  <matrix>    <matrix> <matrix>
## 1             135   TNFRSF18        ADNP  0.796526 1.15983e-04 14.85689
## 2             266       NADK        ADNP  0.789720 2.14101e-07 26.90142
## 3             491 AL139246.3        ADNP -0.528735 1.98772e-02  5.42265
## 4             991        CA6        ADNP  0.863810 2.09927e-03  9.46065
## 5            1103   SLC25A33        ADNP -0.834313 1.01188e-03 10.80569
## ...           ...        ...         ...       ...         ...      ...
## 2575696    155577       SCO2        ZXDC -0.548795 3.06126e-02  4.67456
## 2575697    155577       TYMP        ZXDC -0.560573 4.48926e-04 12.31669
## 2575698    155577      ODF3B        ZXDC -0.549052 6.00511e-03  7.54877
## 2575699    157674        OGT        ZXDC  0.694312 1.01153e-02  6.61447
## 2575700    158053       RADX        ZXDC  0.727684 3.72567e-06 21.40105

5.1.1 Aggregate cells

The pruneRegulon function also supports aggregation of cells into metacells. This can substantially speed up GRN refinement if the dataset is large (>100K cells). Ideally, cell aggregation should also reduce noise while still maintaining the underlying biological signals. Since the data is less sparse after aggregation, users may want to adjust the feature cutoffs to better differentiate between cell aggregates with low and high expression. Setting exp_cutoff and peak_cutoff to NULL allows pruneRegulon to calculate the mean values for each feature and use these as the thresholds.

pruned.regulon.aggr <- pruneRegulon(expMatrix = GeneExpressionMatrix,
                               exp_assay = "normalizedCounts",
                               peakMatrix = PeakMatrix,
                               peak_assay = "counts",
                               regulon = regulon,
                               prune_value = "pval",
                               regulon_cutoff = 0.05,
                               aggregateCells = TRUE,
                               useDim = "LSI_RNA",
                               cellNum = 10,
                               exp_cutoff = NULL,
                               peak_cutoff = NULL)

5.2 Annotate with TF motifs

So far, the gene regulatory network has been constructed exclusively from TF ChIP-seq data. Some users may want to further annotate regulatory elements with motifs. We provide an option to annotate peaks with motifs from the CIS_BP database. If no motifs are present for a particular factor (e.g., for co-factors or chromatin modifiers), NA is returned. If motifs exist for a factor and the regulatory element contains a motif, 1 is returned; if the regulatory element does not contain a motif, 0 is returned. Users can also provide their own motif annotation through the pwms argument.

It is important to note that filtering for the presence of motifs removes a large fraction of the target genes. Motifs are often present in a small subset of the ChIP-seq peaks (can be as low as 10%). Second, indirect TF binding, possibly through its interaction partners, may have a true biological function.

pruned.regulon.motif <- addMotifScore(regulon = pruned.regulon,
                                      peaks = rowRanges(PeakMatrix),
                                      species = "human",
                                      genome = "hg38")

# filter out the rows where TF motif is not found in the peak region
pruned.regulon.motif <- pruned.regulon.motif[which(pruned.regulon.motif$motif == 1),]

pruned.regulon.motif[,c("idxATAC", "target", "tf", "corr", "p_val_peak_gene", "pval", "motif")]

## DataFrame with 192760 rows and 7 columns
##          idxATAC   target          tf      corr p_val_peak_gene        pval     motif
##        <integer>  <array> <character>  <matrix>        <matrix>    <matrix> <numeric>
## 1            199    ACAP3         AHR -0.628315      0.01717839 2.44093e-02         1
## 2            723 TNFRSF25         AHR  0.771313      0.02373300 3.79686e-08         1
## 3           1100 SLC25A33         AHR -0.726697      0.00437233 2.41562e-08         1
## 4           3937   RNF19B         AHR -0.715055      0.00534396 1.14941e-02         1
## 5           4494     PPT1         AHR  0.718666      0.03720643 2.44180e-05         1
## ...          ...      ...         ...       ...             ...         ...       ...
## 192756    157055      CFP     ZSCAN16  0.829179       0.0130202 0.009152565         1
## 192757     33086   GTPBP8      ZSCAN4  0.773445       0.0230119 0.000171032         1
## 192758     40337    PYURF      ZSCAN4  0.701525       0.0431397 0.019287955         1
## 192759    104541    ZNF84      ZSCAN4  0.745546       0.0295838 0.034884124         1
## 192760    115841   PARP16      ZSCAN4  0.687043       0.0487639 0.003539738         1

5.3 Annotate with log fold changes

It is sometimes helpful to filter the regulons based on gene expression changes between two conditions. The addLogFC function is a wrapper around scran::findMarkers and adds extra columns of log changes to the regulon DataFrame. Users can specify the reference group in logFC_ref and comparison groups in logFC_condition. If these are not provided, log fold changes are calculated for every condition in the cluster labels against the rest of the conditions.

# create logcounts
GeneExpressionMatrix <- scuttle::logNormCounts(GeneExpressionMatrix)

# add log fold changes which are calculated by taking the difference of the log counts
pruned.regulon <- addLogFC(regulon = pruned.regulon,
                           clusters = GeneExpressionMatrix$cell_type,
                           expMatrix  = GeneExpressionMatrix,
                           assay.type  = "logcounts",
                           pval.type = "any",
                           logFC_condition = unique(GeneExpressionMatrix$cell_type))

pruned.regulon[,c("idxATAC", "target", "tf", "Memory CD8+ T.vs.rest.logFC", "B.vs.rest.FDR")]

## DataFrame with 2575700 rows and 5 columns
##           idxATAC     target          tf Memory CD8+ T.vs.rest.logFC B.vs.rest.FDR
##         <integer>    <array> <character>                   <numeric>     <numeric>
## 1             135   TNFRSF18        ADNP                  -0.0906689   2.12607e-23
## 2             266       NADK        ADNP                  -0.3582261   1.86039e-71
## 3             491 AL139246.3        ADNP                  -0.0156938   2.86752e-02
## 4             991        CA6        ADNP                  -0.3802831   2.38300e-91
## 5            1103   SLC25A33        ADNP                  -0.2311624   5.77745e-26
## ...           ...        ...         ...                         ...           ...
## 2575696    155577       SCO2        ZXDC                  -0.6694588  1.26531e-315
## 2575697    155577       TYMP        ZXDC                  -2.3157151   0.00000e+00
## 2575698    155577      ODF3B        ZXDC                  -0.3380033  2.16867e-129
## 2575699    157674        OGT        ZXDC                   0.5698316   3.41241e-56
## 2575700    158053       RADX        ZXDC                   0.0456558   1.51853e-04

TFs do not necessarily alter the expression of the genes next to where they bind, and thus filtering for target genes that do show differential expression can help refine the regulons. Since we do not have a defined experimental setup in this dataset, we retain target genes that show differential expression in any cell types vs. the rest.

FDR <- pruned.regulon[,grep("rest.FDR", colnames(pruned.regulon ))]
FDR <- as.matrix(FDR) < 0.05
FDR <- apply(FDR,2, as.numeric)

logFC <- pruned.regulon[,grep("rest.logFC", colnames(pruned.regulon ))]
logFC <- abs(as.matrix(logFC)) > 0.3
logFC <- apply(logFC,2, as.numeric)

logFC_FDR <- FDR*logFC

pruned.regulon.filtered <- pruned.regulon[which(rowSums(logFC_FDR) > 0), ]