4 Construction of the gene regulatory network

This chapter outlines the basic steps for constructing a gene regulatory network, which takes the form of a tripartite graph: TF → RE → TG. In brief, TF occupancy data—derived from either ChIP-seq or motif information—are used to map transcription factors onto genomic regions (here, ATAC-seq peaks). These regions are then linked to their target genes by leveraging the assumption that chromatin accessibility and gene expression are highly correlated.

4.1 Download and format dataset

We will make use of the PBMC data set from 10x Genomics. The MAE object was uploaded to scMultiome for full reproducibility.

# Download the example dataset
mae <- scMultiome::PBMC_10x()

# Load peak matrix
PeakMatrix <- mae[["PeakMatrix"]]

# Load expression matrix
GeneExpressionMatrix <- mae[["GeneExpressionMatrix"]]

Visualize the data.

scater::plotReducedDim(GeneExpressionMatrix, 
                       dimred = "UMAP_RNA", 
                       text_by = "cell_type", 
                       colour_by = "cell_type",
                       point_size = 0.3,
                       point_alpha = 0.3)

4.2 Retrieve TF occupancy data

We recommend using ChIP-seq data—rather than motifs—as the proxy for TF occupancy. The gold standard in the field is to use sample-matched ChIP-seq for TF binding. To support this, we provide both sample-specific and tissue-specific ChIP-seq peak sets. However, sample-matched or tissue-matched ChIP-seq data are not available for many factors. To address this, we include a set of pan-tissue ChIP-seq peaks, generated by merging data across multiple samples and lineages; this serves as the default TF occupancy source. Benchmarking in our manuscript demonstrates that pan-tissue ChIP-seq still outperforms motif-based annotations for TF activity inference and target-gene recovery. In addition, ChIP-seq offers a key advantage over motifs in enabling GRN inference for transcriptional co-regulators that lack known motifs.

One limitation of ChIP-seq–based occupancy is that its accuracy depends on the underlying experimental quality. We applied several quality-control filters to the ChIP-seq data included (see Methods in the manuscript), but some factors still exhibit very large numbers of binding sites (>100k). This may reflect true biology or simply false positives. In such cases, filtering TF occupancy based on the intersection of ChIP-seq peaks and motif sites (see next chapter) can be helpful.

4.2.1 ChIP-seq data

We retrieve a GRangesList object containing the binding sites of all the transcription factors and co-regulators. These binding sites are derived from bulk ChIP-seq data in the ChIP-Atlas and ENCODE databases, and merged across multiple cell lines or tissues. Currently, human genomes hg19 and hg38 and mouse mm10 are supported. All ChIP-seq and motif binding sites are hosted on ExperimentHub. The initial download may take some time, but the data are cached for subsequent use.

library(epiregulon)
grl <- getTFMotifInfo(genome = "hg38")
grl
## GRangesList object of length 1558:
## $AEBP2
## GRanges object with 2700 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##      [1]     chr1        9792-10446      *
##      [2]     chr1     942105-942400      *
##      [3]     chr1     984486-984781      *
##      [4]     chr1   3068932-3069282      *
##      [5]     chr1   3069411-3069950      *
##      ...      ...               ...    ...
##   [2696]     chrY   8465261-8465730      *
##   [2697]     chrY 11721744-11722260      *
##   [2698]     chrY 11747448-11747964      *
##   [2699]     chrY 19302661-19303134      *
##   [2700]     chrY 19985662-19985982      *
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths
## 
## ...
## <1557 more elements>

For both ChIP-Atlas and ENCODE data, we also provide sample-specific TF binding sites. They are retrieved as a list of GrangesList objects, by specifying atlas.sample or encode.sample in the source argument.

grl_list_sample_atlas <- getTFMotifInfo(genome = "hg38", source = "atlas.sample")
head(names(grl_list_sample_atlas))
## [1] "ML-2"        "hTERT RPE-1" "DLD-1"       "293"         "HeLa"        "MCF-7"

Then we need to access a specific list element by providing the sample of interest. Here we choose "Peripheral blood mononuclear cells".

grl_list_sample_atlas[["Peripheral blood mononuclear cells"]]
## GRangesList object of length 2:
## $EZH2
## GRanges object with 14399 ranges and 0 metadata columns:
##                         seqnames        ranges strand
##                            <Rle>     <IRanges>  <Rle>
##       [1]                   chr1    9949-10539      *
##       [2]                   chr1   28949-29442      *
##       [3]                   chr1 180750-181044      *
##       [4]                   chr1 199296-200054      *
##       [5]                   chr1 629280-630005      *
##       ...                    ...           ...    ...
##   [14395]       chrUn_KI270435v1   91984-92197      *
##   [14396]       chrUn_KI270512v1       252-395      *
##   [14397] chr15_KI270727v1_ran..   80081-80345      *
##   [14398] chr1_KI270706v1_random 165426-165601      *
##   [14399]       chrUn_KI270743v1 152955-153105      *
##   -------
##   seqinfo: 89 sequences from an unspecified genome; no seqlengths
## 
## $HIF1A
## GRanges object with 2199 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##      [1]     chr1       10009-10248      *
##      [2]     chr1       10359-10460      *
##      [3]     chr1     180748-180991      *
##      [4]     chr1     631239-631293      *
##      [5]     chr1     631989-632060      *
##      ...      ...               ...    ...
##   [2195]     chrY 56826777-56826855      *
##   [2196]     chrY 56847169-56847217      *
##   [2197]     chrY 56850383-56850460      *
##   [2198]     chrY 56850790-56850873      *
##   [2199]     chrY 56851105-56851159      *
##   -------
##   seqinfo: 89 sequences from an unspecified genome; no seqlengths

Alternatively, we might retrieve chip-seq data for "K562" cells present in the ENCODE database.

grl_list_sample_encode <- getTFMotifInfo(genome = "hg38", source = "encode.sample")
grl_list_sample_encode[["K562"]]
## GRangesList object of length 468:
## $PYGO2
## GRanges object with 6664 ranges and 0 metadata columns:
##          seqnames              ranges strand
##             <Rle>           <IRanges>  <Rle>
##      [1]     chr3     3110104-3110423      *
##      [2]     chr3     4225236-4225599      *
##      [3]     chr3     4350776-4351095      *
##      [4]     chr3     4467164-4467483      *
##      [5]     chr3     4533816-4534135      *
##      ...      ...                 ...    ...
##   [6660]    chr13   99053615-99053934      *
##   [6661]    chr13 109707096-109707415      *
##   [6662]    chr13 110874168-110874531      *
##   [6663]    chr13 110963556-110963834      *
##   [6664]    chr13 112711615-112711978      *
##   -------
##   seqinfo: 54 sequences from an unspecified genome; no seqlengths
## 
## ...
## <467 more elements>

The tissue-specific chip-seq data are also available. They were compiled from the ChIP-Atlas database.

grl_list_tissue <- getTFMotifInfo(genome = "hg38", source = "atlas.tissue")
grl_list_tissue[["Blood"]]
## GRangesList object of length 770:
## $`MLL-AF6`
## GRanges object with 477 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]     chr1     4303013-4303150      *
##     [2]     chr1   71928076-71928393      *
##     [3]     chr1   87300735-87300856      *
##     [4]     chr1   91387234-91387457      *
##     [5]     chr1 108771977-108772126      *
##     ...      ...                 ...    ...
##   [473]     chrY   56838883-56839022      *
##   [474]     chrY   56843654-56843767      *
##   [475]     chrY   56848774-56848878      *
##   [476]     chrY   56850362-56850481      *
##   [477]     chrY   56851423-56851611      *
##   -------
##   seqinfo: 183 sequences from an unspecified genome; no seqlengths
## 
## ...
## <769 more elements>

The advantage of using tissue- or sample-specific ChIP-seq data is that they accurately reflect TF binding sites that are characteristic of a given cell type or cell line. The downside is the small number of factors and peaks due to the limited number of experiments used to compile the data.

# calculate the total number of factors contained in each grl
# pan-tissue
length(grl)
## [1] 1558
# blood tissue
length(grl_list_tissue[["Blood"]])
## [1] 770
# Peripheral blood mononuclear cells sample
length(grl_list_sample_atlas[["Peripheral blood mononuclear cells"]])
## [1] 2
# K562 cell line sample
length(grl_list_sample_encode[["K562"]])
## [1] 468
# calculate the total number of regions contained in each grl
# pan-tissue
format(sum(unlist(lapply(grl, length))), big.mark=",")
## [1] "54,501,169"
# blood tissue
format(sum(unlist(lapply(grl_list_tissue[["Blood"]], length))), big.mark=",")
## [1] "18,684,867"
# Peripheral blood mononuclear cells sample
format(sum(unlist(lapply(grl_list_sample_atlas[["Peripheral blood mononuclear cells"]], length))), big.mark=",")
## [1] "16,598"
# K562 cell line sample
format(sum(unlist(lapply(grl_list_sample_encode[["K562"]], length))), big.mark=",")
## [1] "6,352,422"

Lastly, users can always supply their custom ChIP-seq data and convert them into GrangesList.

4.2.2 Motif information

While ChIP-seq data provide experimentally determined TF biding sites, some users might prefer to start from motif sequences as positions of TF occupancy. If motif is chosen for mode, getTFMotifInfo retrieves the motif information from the CIS_BP database and scans the peak regions for presence of motifs.

grl_motif <- getTFMotifInfo(genome = "hg38", mode = "motif", peaks = rowRanges(PeakMatrix))

4.4 Add TF occupancy to peaks

The next step is to add TF binding information to peak regions.

overlap <- addTFMotifInfo(grl = grl, p2g = p2g, peakMatrix = PeakMatrix)
head(overlap)
##     idxATAC idxTF     tf
## 711       7    12 ARID5B
## 712       7    91   ELF3
## 713       7   150   HEY1
## 714       7   159  HNF4A
## 715       7   160  HNF4G
## 716       7   217    MAX

4.5 Generate regulons

A DataFrame, representing the inferred regulons, is then generated. The DataFrame object consists of the following columns:

  1. idxATAC - index of the peak in the peak matrix
  2. chr - chromosome number
  3. start - start position of the peak
  4. end - end position of the peak
  5. idxRNA - index in the gene expression matrix corresponding to the target gene
  6. target - name of the target gene
  7. corr - correlation between target gene expression and the chromatin accessibility at the peak. If cluster labels are provided, this field is a matrix with columns names corresponding to correlation across all cells and for each of the clusters.
  8. p_val_peak_gene - empirical p-value of the correlation between peak and target gene. If cluster labels are provided, this field is a matrix with columns names corresponding to p-value for all cells and for each of the clusters.
  9. FDR_peak_gene - false discovery rates calculated based on the empirical p-value of the correlations between peaks and target genes. If cluster labels are provided, this field is a matrix with columns names corresponding to FDR for all cells and for each of the clusters.
  10. distance - distance between the transcription start site of the target gene and the middle of the peak
  11. idxTF - index in the gene expression matrix corresponding to the transcription factor
  12. tf - name of the transcription factor
regulon <- getRegulon(p2g, overlap, aggregate=FALSE)
regulon
## DataFrame with 9563289 rows and 12 columns
##           idxATAC         chr     start       end    idxRNA  target     corr p_val_peak_gene FDR_peak_gene  distance     idxTF
##         <integer> <character> <integer> <integer> <integer> <array> <matrix>        <matrix>      <matrix> <integer> <integer>
## 1               7        chr1    860832    861332        16  FAM87B 0.854753      0.00984755      0.476024     43461        12
## 2               7        chr1    860832    861332        16  FAM87B 0.854753      0.00984755      0.476024     43461        91
## 3               7        chr1    860832    861332        16  FAM87B 0.854753      0.00984755      0.476024     43461       150
## 4               7        chr1    860832    861332        16  FAM87B 0.854753      0.00984755      0.476024     43461       159
## 5               7        chr1    860832    861332        16  FAM87B 0.854753      0.00984755      0.476024     43461       160
## ...           ...         ...       ...       ...       ...     ...      ...             ...           ...       ...       ...
## 9563285    159285        chrX 155611289 155611789     36434   TMLHE 0.852475       0.0100948      0.476024      1086       348
## 9563286    159285        chrX 155611289 155611789     36434   TMLHE 0.852475       0.0100948      0.476024      1086       360
## 9563287    159285        chrX 155611289 155611789     36434   TMLHE 0.852475       0.0100948      0.476024      1086       432
## 9563288    159285        chrX 155611289 155611789     36434   TMLHE 0.852475       0.0100948      0.476024      1086       630
## 9563289    159285        chrX 155611289 155611789     36434   TMLHE 0.852475       0.0100948      0.476024      1086       785
##                  tf
##         <character>
## 1            ARID5B
## 2              ELF3
## 3              HEY1
## 4             HNF4A
## 5             HNF4G
## ...             ...
## 9563285     SMARCC1
## 9563286        SPI1
## 9563287      ZBTB7A
## 9563288         GFP
## 9563289       MYH11