2 Quick start

In this chapter, we illustrate the main steps of the epiregulon workflow.

2.1 Prepare data

We need the following components to perform GRN analysis:
1. Peak matrix from scATAC-seq
2. Paired gene expression matrix from scRNA-seq
3. Dimensionality reduction matrix

We download an example PBMC dataset from the scMultiome package.

# load the MAE object
library(scMultiome)
library(epiregulon)

# Download the example dataset
mae <- scMultiome::PBMC_10x()

# Load peak matrix
PeakMatrix <- mae[["PeakMatrix"]]

# Load expression matrix
GeneExpressionMatrix <- mae[["GeneExpressionMatrix"]]

Visualize singleCellExperiment by UMAP

scater::plotReducedDim(GeneExpressionMatrix, 
                       dimred = "UMAP_RNA", 
                       text_by = "cell_type", 
                       colour_by = "cell_type",
                       point_size = 0.3,
                       point_alpha = 0.3)

2.2 Retrieve bulk TF ChIP-seq binding sites

We retrieve a GRangesList object containing the binding sites of transcription factors and co-regulators. These binding sites are derived from bulk ChIP-seq data in the ChIP-Atlas and ENCODE databases. For further information on how these peaks are derived, please refer to ?epiregulon::getTFMotifInfo. Currently, only human genomes hg19 and hg38 and mouse mm10 are supported.

grl <- getTFMotifInfo(genome = "hg38")
grl
## GRangesList object of length 1377:
## $AEBP2
## GRanges object with 2700 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##      [1]     chr1        9792-10446      *
##      [2]     chr1     942105-942400      *
##      [3]     chr1     984486-984781      *
##      [4]     chr1   3068932-3069282      *
##      [5]     chr1   3069411-3069950      *
##      ...      ...               ...    ...
##   [2696]     chrY   8465261-8465730      *
##   [2697]     chrY 11721744-11722260      *
##   [2698]     chrY 11747448-11747964      *
##   [2699]     chrY 19302661-19303134      *
##   [2700]     chrY 19985662-19985982      *
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths
## 
## ...
## <1376 more elements>

2.4 Add TF motif binding to peaks

We next add the TF binding information by overlapping ATAC peaks with ChIP-seq peaks. The output is a data frame object with three columns:

  1. idxATAC - index in the peak matrix
  2. idxTF - index in the gene expression matrix corresponding to the transcription factor
  3. tf - name of the transcription factor
overlap <- addTFMotifInfo(grl = grl, p2g = p2g, peakMatrix = PeakMatrix)
head(overlap)
##   idxATAC idxTF    tf
## 1       1    16  ATF1
## 2       1    17  ATF2
## 3       1    18  ATF3
## 4       1    21  ATF7
## 5       1    35 BRCA2
## 6       1    36  BRD4

2.5 Generate regulons

We then generate a DataFrame object representing the inferred regulons, indicating target genes, transcription factors, and regulatory elements.

regulon <- getRegulon(p2g = p2g, overlap = overlap)
regulon
## DataFrame with 8418612 rows and 12 columns
##           idxATAC         chr     start       end    idxRNA    target      corr p_val_peak_gene FDR_peak_gene  distance     idxTF
##         <integer> <character> <integer> <integer> <integer>   <array>  <matrix>        <matrix>      <matrix> <integer> <integer>
## 1               1        chr1    817086    817586        17 LINC00115  0.487936        0.039787      0.616636      9934        16
## 2               1        chr1    817086    817586        17 LINC00115  0.487936        0.039787      0.616636      9934        17
## 3               1        chr1    817086    817586        17 LINC00115  0.487936        0.039787      0.616636      9934        18
## 4               1        chr1    817086    817586        17 LINC00115  0.487936        0.039787      0.616636      9934        21
## 5               1        chr1    817086    817586        17 LINC00115  0.487936        0.039787      0.616636      9934        35
## ...           ...         ...       ...       ...       ...       ...       ...             ...           ...       ...       ...
## 8418608    159256        chrX 154800772 154801272     36419      F8A1 -0.393617       0.0257158      0.568362     85075       979
## 8418609    159256        chrX 154800772 154801272     36419      F8A1 -0.393617       0.0257158      0.568362     85075       991
## 8418610    159256        chrX 154800772 154801272     36419      F8A1 -0.393617       0.0257158      0.568362     85075      1018
## 8418611    159256        chrX 154800772 154801272     36419      F8A1 -0.393617       0.0257158      0.568362     85075      1020
## 8418612    159256        chrX 154800772 154801272     36419      F8A1 -0.393617       0.0257158      0.568362     85075      1335
##                      tf
##             <character>
## 1                  ATF1
## 2                  ATF2
## 3                  ATF3
## 4                  ATF7
## 5                 BRCA2
## ...                 ...
## 8418608             MNT
## 8418609            NFIA
## 8418610          POLR2A
## 8418611 POLR2AphosphoS5
## 8418612          ZNF766

2.7 Annotate regulon

Regulons can be annotated for further refinement. One such metric is the log fold change of gene expression in specified groups. Before running addLogFC function, we need to log-transform gene expression counts. In this example, we are interested in target genes differentially expressed in Naive CD4+ T cells.

# add 'logcounts' assay
GeneExpressionMatrix <- scuttle::logNormCounts(GeneExpressionMatrix)

regulon.FC <- addLogFC(expMatrix = GeneExpressionMatrix,
                       clusters = GeneExpressionMatrix$cell_type,
                       regulon = pruned.regulon,
                       pval.type = "any",
                       sig_type = "FDR",
                       assay.type = "logcounts",
                       logFC_condition=c('Naive CD4+ T')
)

regulon.FC
## DataFrame with 2754396 rows and 17 columns
##           idxATAC         chr     start       end    idxRNA   target      corr p_val_peak_gene FDR_peak_gene  distance     idxTF
##         <integer> <character> <integer> <integer> <integer>  <array>  <matrix>        <matrix>      <matrix> <integer> <integer>
## 1             135        chr1   1207949   1208449        44  TNFRSF4  0.618308      0.01483209      0.528124      5702       887
## 2             135        chr1   1207949   1208449        43 TNFRSF18  0.686583      0.00836231      0.498760      2269       887
## 3             266        chr1   1686177   1686677        86     NADK  0.682560      0.00867039      0.499675     91791       887
## 4             991        chr1   8947075   8947575       222      CA6  0.581208      0.02000352      0.551146      1207       887
## 5            1103        chr1   9628994   9629494       233 SLC25A33 -0.545647      0.00456395      0.463887     89529       887
## ...           ...         ...       ...       ...       ...      ...       ...             ...           ...       ...       ...
## 2754392    155544       chr22  50524937  50525437     23714     TYMP -0.378229       0.0296565      0.579085      4557      1377
## 2754393    155577       chr22  50582163  50582663     23714     TYMP -0.365791       0.0335423      0.595442     52167      1377
## 2754394    155577       chr22  50582163  50582663     23715    ODF3B -0.337503       0.0436050      0.623854     50086      1377
## 2754395    158053        chrX 106611664 106612164     36005     RADX  0.551794       0.0249989      0.566187         0      1377
## 2754396    158949        chrX 150547614 150548114     36307     MTM1  0.579942       0.0201796      0.551336     20503      1377
##                  tf        pval    stats      qval Naive CD4+ T.vs.rest.FDR Naive CD4+ T.vs.rest.logFC
##         <character>    <matrix> <matrix>  <matrix>                <numeric>                  <numeric>
## 1              ADNP 3.83812e-09 34.70415 0.0286741              1.24716e-40                  -0.158980
## 2              ADNP 1.15983e-04 14.85689 1.0000000              2.07408e-28                  -0.106426
## 3              ADNP 2.14101e-07 26.90142 1.0000000             4.93457e-177                  -0.402145
## 4              ADNP 2.09927e-03  9.46065 1.0000000              1.74410e-35                  -0.265716
## 5              ADNP 1.01188e-03 10.80569 1.0000000              5.97271e-65                  -0.230591
## ...             ...         ...      ...       ...                      ...                        ...
## 2754392        ZXDC 1.93434e-02  5.47020         1              0.00000e+00                 -2.3462477
## 2754393        ZXDC 4.48926e-04 12.31669         1              0.00000e+00                 -2.3462477
## 2754394        ZXDC 6.00511e-03  7.54877         1             9.44166e-179                 -0.3415591
## 2754395        ZXDC 3.72567e-06 21.40105         1              3.21089e-07                  0.0323369
## 2754396        ZXDC 3.33531e-03  8.61432         1              2.88666e-23                  0.1631049

We can also annotate regulons for the presence of motifs at each regulatory element.

regulon.motif <- addMotifScore(regulon = pruned.regulon,
                                peaks = rowRanges(PeakMatrix),
                                species = "human",
                                genome = "hg38")
regulon.motif
## DataFrame with 2754396 rows and 16 columns
##           idxATAC         chr     start       end    idxRNA   target      corr p_val_peak_gene FDR_peak_gene  distance     idxTF
##         <integer> <character> <integer> <integer> <integer>  <array>  <matrix>        <matrix>      <matrix> <integer> <integer>
## 1             135        chr1   1207949   1208449        44  TNFRSF4  0.618308      0.01483209      0.528124      5702       887
## 2             135        chr1   1207949   1208449        43 TNFRSF18  0.686583      0.00836231      0.498760      2269       887
## 3             266        chr1   1686177   1686677        86     NADK  0.682560      0.00867039      0.499675     91791       887
## 4             991        chr1   8947075   8947575       222      CA6  0.581208      0.02000352      0.551146      1207       887
## 5            1103        chr1   9628994   9629494       233 SLC25A33 -0.545647      0.00456395      0.463887     89529       887
## ...           ...         ...       ...       ...       ...      ...       ...             ...           ...       ...       ...
## 2754392    155544       chr22  50524937  50525437     23714     TYMP -0.378229       0.0296565      0.579085      4557      1377
## 2754393    155577       chr22  50582163  50582663     23714     TYMP -0.365791       0.0335423      0.595442     52167      1377
## 2754394    155577       chr22  50582163  50582663     23715    ODF3B -0.337503       0.0436050      0.623854     50086      1377
## 2754395    158053        chrX 106611664 106612164     36005     RADX  0.551794       0.0249989      0.566187         0      1377
## 2754396    158949        chrX 150547614 150548114     36307     MTM1  0.579942       0.0201796      0.551336     20503      1377
##                  tf        pval    stats      qval     motif
##         <character>    <matrix> <matrix>  <matrix> <numeric>
## 1              ADNP 3.83812e-09 34.70415 0.0286741        NA
## 2              ADNP 1.15983e-04 14.85689 1.0000000        NA
## 3              ADNP 2.14101e-07 26.90142 1.0000000        NA
## 4              ADNP 2.09927e-03  9.46065 1.0000000        NA
## 5              ADNP 1.01188e-03 10.80569 1.0000000        NA
## ...             ...         ...      ...       ...       ...
## 2754392        ZXDC 1.93434e-02  5.47020         1        NA
## 2754393        ZXDC 4.48926e-04 12.31669         1        NA
## 2754394        ZXDC 6.00511e-03  7.54877         1        NA
## 2754395        ZXDC 3.72567e-06 21.40105         1        NA
## 2754396        ZXDC 3.33531e-03  8.61432         1        NA

2.8 Calculate TF activity

Finally, activities for a specific TF in each cell are computed by averaging expressions of target genes weighted by the test statistics of choice. \[y=\frac{1}{n}\sum_{i=1}^{n} x_i * weights_i\] where \(y\) is the activity of a TF for a cell,
\(n\) is the total number of targets for a TF,
\(x_i\) is the log count expression of target \(i\) where \(i\) in {1,2,…,n} and
\(weights_i\) is the weight of TF - target \(i\), calculated by running addWeights function.

regulon.w <- addWeights(regulon = pruned.regulon,
                        expMatrix  = GeneExpressionMatrix,
                        exp_assay  = "normalizedCounts",
                        peakMatrix = PeakMatrix,
                        peak_assay = "counts")


score.combine <- calculateActivity(expMatrix = GeneExpressionMatrix,
                                   regulon = regulon.w, 
                                   mode = "weight", 
                                   exp_assay = "normalizedCounts",
                                   normalize = FALSE)

score.combine[1:5,1:5]
## 5 x 5 sparse Matrix of class "dgCMatrix"
##       PBMC_10k#GGTTGCATCCTGGCTT-1 PBMC_10k#GGTTGCGGTAAACAAG-1 PBMC_10k#TGTTCCTCATAAGTTC-1 PBMC_10k#CGACTAAGTAACGGGA-1
## ADNP                   0.11379320                  0.09052355                  0.11310774                  0.07767001
## AEBP2                  0.03488950                  0.06604866                  0.06581493                  0.07258739
## AFF1                   0.09373167                  0.08967789                  0.10807508                  0.06184494
## AFF4                   0.11164763                  0.10782789                  0.13602988                  0.07292949
## AGO1                   0.09017644                  0.07779564                  0.08892206                  0.05903171
##       PBMC_10k#CTGCTCCCAAGGTCCT-1
## ADNP                    0.1201884
## AEBP2                   0.0780261
## AFF1                    0.1136837
## AFF4                    0.1392258
## AGO1                    0.0889592