3 Data preparation

Epiregulon operates on the SingleCellExperiment class. We assume that gene expression, chromatin accessibility and dimension reduction have been obtained by users’ favorite packages prior to the use of Epiregulon. This chapter provides instructions on how to convert gene expression matrix and peak matrix into SingleCellExperiment objects from other formats including ArchR projects, Seurat objects, AnnData and 10x genomics output. The first section provides a quick primer on the components of a SingleCellExperiment object necessary to run Epiregulon. It is recommended for all users to go through it.

3.1 Construct a SingleCellExperiment object

Let’s construct a GeneExpressionMatrix from scratch. First we create the count matrix.

library(SingleCellExperiment)
counts <- matrix(rpois(100000, lambda = 2), ncol=1000, nrow=100)
GeneExpressionMatrix <- SingleCellExperiment(list(counts=counts))
rownames(GeneExpressionMatrix) <- paste("Gene",1:100, sep="-")
colnames(GeneExpressionMatrix) <- paste("Cell",1:1000, sep="-")

It is important (and efficient) to convert the count matrix to dgCMatrix format at the beginning of the workflow.

library(Matrix)
counts(GeneExpressionMatrix) <- as(counts(GeneExpressionMatrix), "dgCMatrix")

Next we will add the cell information to colData

colData(GeneExpressionMatrix) <- DataFrame(Cluster = paste("cluster", sample(1:3,1000, TRUE)))

For the purpose of Epiregulon, it is important to provide the start and end position of the genes so that we can link genes to the peak regions.

seqnames <- paste0("chr", sample(1:2, 100, TRUE))
start <- sample(1:100000, 100, TRUE)
end <- start + sample(100:500, 100, TRUE)
strand <- sample(c("+", "-", "*"), 100, TRUE)

gr <- GRanges(
    seqnames = seqnames,
    ranges = IRanges(start = start, end = end),
    strand = strand
)

Next we provide additional information about the genes. We can add them into the mcols as DataFrame. The GRanges become the rowRanges of the GeneExpressionMatrix.

mcols(gr) <- DataFrame(name = paste("Gene", 1:100, sep="-"),
                       ID = paste0("ID", 1:100))

rowRanges(GeneExpressionMatrix) <- gr
rowRanges(GeneExpressionMatrix)
## GRanges object with 100 ranges and 2 metadata columns:
##         seqnames      ranges strand |        name          ID
##            <Rle>   <IRanges>  <Rle> | <character> <character>
##     [1]     chr1 84756-84878      * |      Gene-1         ID1
##     [2]     chr2 62792-63002      - |      Gene-2         ID2
##     [3]     chr1 70586-71071      * |      Gene-3         ID3
##     [4]     chr1 40856-41031      + |      Gene-4         ID4
##     [5]     chr2 34127-34534      - |      Gene-5         ID5
##     ...      ...         ...    ... .         ...         ...
##    [96]     chr2 91230-91581      * |     Gene-96        ID96
##    [97]     chr1 45345-45716      - |     Gene-97        ID97
##    [98]     chr2 86249-86388      + |     Gene-98        ID98
##    [99]     chr1 51039-51145      - |     Gene-99        ID99
##   [100]     chr1 57524-57636      - |    Gene-100       ID100
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

Additional information about the genes will also appear in rowData

rowData(GeneExpressionMatrix)
## DataFrame with 100 rows and 2 columns
##            name          ID
##     <character> <character>
## 1        Gene-1         ID1
## 2        Gene-2         ID2
## 3        Gene-3         ID3
## 4        Gene-4         ID4
## 5        Gene-5         ID5
## ...         ...         ...
## 96      Gene-96        ID96
## 97      Gene-97        ID97
## 98      Gene-98        ID98
## 99      Gene-99        ID99
## 100    Gene-100       ID100

Finally we add some reduced dimension data which is needed for clustering

reducedDim(GeneExpressionMatrix, "PCA") <- matrix(data=rnorm(20000), nrow=1000, ncol=20)
GeneExpressionMatrix
## class: SingleCellExperiment 
## dim: 100 1000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): name ID
## colnames: NULL
## colData names(1): Cluster
## reducedDimNames(1): PCA
## mainExpName: NULL
## altExpNames(0):

Repeat the process to create a PeakMatrix

# add counts
counts <- matrix(rpois(1000000, lambda = 1), ncol=1000, nrow=1000)
PeakMatrix <- SingleCellExperiment(list(counts=counts))
rownames(PeakMatrix) <- paste("Peak",1:1000, sep="-")
colnames(PeakMatrix) <- paste("Cell",1:1000, sep="-")

# convert count matrix to dgCMatrix
counts(PeakMatrix) <- as(counts(PeakMatrix), "dgCMatrix")

# add rowRanges
seqnames <- paste0("chr", sample(1:2, 1000, TRUE))
start <- sample(1:100000, 1000, TRUE)
end <- start + sample(100:500, 1000, TRUE)
strand <- sample(c("+", "-", "*"), 1000, TRUE)

gr <- GRanges(
    seqnames = seqnames,
    ranges = IRanges(start = start, end = end),
    strand = strand
)

# add rowData
mcols(gr) <- DataFrame(name = paste("Peak", 1:1000, sep="-"))

rowRanges(PeakMatrix) <- gr

# add reduced dimensionality matrix

reducedDim(PeakMatrix, "ATAC_LSI") <- matrix(data=rnorm(20000), nrow=1000, ncol=20)
PeakMatrix
## class: SingleCellExperiment 
## dim: 1000 1000 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): name
## colnames(1000): Cell-1 Cell-2 ... Cell-999 Cell-1000
## colData names(0):
## reducedDimNames(1): ATAC_LSI
## mainExpName: NULL
## altExpNames(0):

For more information on SingleCellExperiment, please refer to the documentation on bioconductor.

For a real example of a properly formatted SingleCellExperiment object, check out datasets from the scMultiome package. Both the GeneExpressionMatrix and PeakMatrix are combined into a MultiAssayExperiment object.

mae <- scMultiome::reprogramSeq()

Inspect the gene expression matrix

GeneExpressMatrix <- mae[["GeneExpressionMatrix"]]
GeneExpressMatrix
## class: SingleCellExperiment 
## dim: 36438 3903 
## metadata(1): .internal
## assays(2): counts normalizedCounts
## rownames(36438): MIR1302-2HG FAM138A ... IL9R WASIR1
## rowData names(2): name strand.1
## colnames(3903): reprogram#TTAGGAACAAGGTACG-1 reprogram#GAGCGGTCAACCTGGT-1 ... reprogram#GGTTACTAGACACCGC-1
##   reprogram#CGCTATGAGTGAACAG-1
## colData names(23): BlacklistRatio DoubletEnrichment ... ReadsInPeaks FRIP
## reducedDimNames(2): LSI_RNA UMAP_RNA
## mainExpName: NULL
## altExpNames(0):

Inspect the peak matrix

PeakMatrix <- mae[["PeakMatrix"]]
PeakMatrix
## class: SingleCellExperiment 
## dim: 126602 3903 
## metadata(1): .internal
## assays(1): counts
## rownames: NULL
## rowData names(1): idx
## colnames(3903): reprogram#TTAGGAACAAGGTACG-1 reprogram#GAGCGGTCAACCTGGT-1 ... reprogram#GGTTACTAGACACCGC-1
##   reprogram#CGCTATGAGTGAACAG-1
## colData names(23): BlacklistRatio DoubletEnrichment ... ReadsInPeaks FRIP
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

3.2 Start from an ArchR object

Epiregulon is designed to work seamlessly with ArchR. GeneExpressionMatrix and PeakMatrix can be easily exported from ArchR using ArchR’s build-in functions.

Download a test ArchR project

library(ArchR)
archr.proj <- getTestProject()

Check available matrices in ArchR project

getAvailableMatrices(ArchRProj = archr.proj)
## [1] "GeneIntegrationMatrix" "GeneScoreMatrix"       "MotifMatrix"           "PeakMatrix"            "TileMatrix"

Export the GeneExpressionMatrix and PeakMatrix from the ArchR project.

GeneExpressionMatrix <- getMatrixFromProject(ArchRProj = archr.proj, useMatrix = "GeneIntegrationMatrix")
PeakMatrix <- getMatrixFromProject(ArchRProj = archr.proj, useMatrix = "PeakMatrix")
GeneExpressionMatrix
## class: SummarizedExperiment 
## dim: 2051 127 
## metadata(0):
## assays(1): GeneIntegrationMatrix
## rownames: NULL
## rowData names(6): seqnames start ... name idx
## colnames(127): PBSmall#B.43 PBSmall#T.24 ... PBSmall#B.37 PBSmall#B.4
## colData names(20): BlacklistRatio nDiFrags ... predictedGroup_Un predictedScore_Un
PeakMatrix
## class: RangedSummarizedExperiment 
## dim: 2142 127 
## metadata(0):
## assays(1): PeakMatrix
## rownames: NULL
## rowData names(1): idx
## colnames(127): PBSmall#B.43 PBSmall#T.24 ... PBSmall#B.37 PBSmall#B.4
## colData names(20): BlacklistRatio nDiFrags ... predictedGroup_Un predictedScore_Un

The GeneExpressionMatrix and PeakMatrix exported from ArchR project are in the form of SummarizedExperiment and RangedSummarizedExperiment respectively. We provide a helper function to convert both to SingleCellExperiment class. Furthermore, the genomic location of the genes are transferred from rowData in the RangedSummarizedExperiment to the rowRanges of the SingleCellExperiment.

Please note that GeneExpressionMatrix extracted from ArchR project contain normalized counts (not logged) and for clarity, we rename the assay as “normalizedCounts”

library(epiregulon.archr)
GeneExpressionMatrix <- ArchRMatrix2SCE(rse = GeneExpressionMatrix, rename = "normalizedCounts")
GeneExpressionMatrix
## class: SingleCellExperiment 
## dim: 2051 127 
## metadata(0):
## assays(1): normalizedCounts
## rownames: NULL
## rowData names(2): name idx
## colnames(127): PBSmall#B.43 PBSmall#T.24 ... PBSmall#B.37 PBSmall#B.4
## colData names(20): BlacklistRatio nDiFrags ... predictedGroup_Un predictedScore_Un
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

We can also transform the counts to logcounts. This will add a new assay and we name it as “logcounts”.

GeneExpressionMatrix <- ArchRMatrix2SCE(rse = GeneExpressionMatrix, transform = TRUE, transform_method = "log", log_name = "logcounts")
GeneExpressionMatrix
## class: SingleCellExperiment 
## dim: 2051 127 
## metadata(0):
## assays(2): counts logcounts
## rownames: NULL
## rowData names(2): name idx
## colnames(127): PBSmall#B.43 PBSmall#T.24 ... PBSmall#B.37 PBSmall#B.4
## colData names(20): BlacklistRatio nDiFrags ... predictedGroup_Un predictedScore_Un
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Check that rowRanges have been transferred

rowRanges(GeneExpressionMatrix)
## GRanges object with 2051 ranges and 2 metadata columns:
##          seqnames              ranges strand |        name       idx
##             <Rle>           <IRanges>  <Rle> | <character> <integer>
##      [1]    chr11       126987-139152      - |   LINC01001         1
##      [2]    chr11       193080-194573      + |     SCGB1C1         2
##      [3]    chr11       196761-200258      + |        ODF3         3
##      [4]    chr11       202924-207422      - |       BET1L         4
##      [5]    chr11       208530-215110      + |       RIC8A         5
##      ...      ...                 ...    ... .         ...       ...
##   [2047]     chr5 180551357-180552304      - |       OR2V1       827
##   [2048]     chr5 180581943-180582890      + |       OR2V2       828
##   [2049]     chr5 180620924-180632177      - |       TRIM7       829
##   [2050]     chr5 180650263-180662808      + |      TRIM41       830
##   [2051]     chr5 180683386-180688119      - |      TRIM52       831
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths

We next convert the PeakMatrix to a SingleCellExperiment object. The counts exported from ArchR are raw counts and thus we rename the assay as “counts”

PeakMatrix <- ArchRMatrix2SCE(PeakMatrix, rename = "counts")
PeakMatrix
## class: SingleCellExperiment 
## dim: 2142 127 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(1): idx
## colnames(127): PBSmall#B.43 PBSmall#T.24 ... PBSmall#B.37 PBSmall#B.4
## colData names(20): BlacklistRatio nDiFrags ... predictedGroup_Un predictedScore_Un
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

Neither SummarizedExperiment object or RangedSummarizedExperiment object contains reduced dimension, so we must extract the reduced dimensionality matrix from the ArchR project and add it to GeneExpressionMatrix and/or PeakMatrix

reducedDim(PeakMatrix, "IterativeLSI") <- getReducedDims(ArchRProj = archr.proj, reducedDims = "IterativeLSI")

Refer to ArchR manual for full documentation.

3.3 Start from a Seurat/Signac object

We download an example multimodel dataset from SeuratData.

devtools::install_github('satijalab/seurat-data')
library(SeuratData)

InstallData("pbmcMultiome")

This is a PBMC dataset consisting of both RNAseq and ATACseq data and we load each modality as a separate Seurat object

library(Seurat)
library(Signac)
library(SeuratData)
pbmc.rna <- LoadData("pbmcMultiome", "pbmc.rna")
pbmc.atac <- LoadData("pbmcMultiome", "pbmc.atac")

We first convert the RNA Seurat object to GeneExpressionMatrix SingleCellExperiment class.

GeneExpressionMatrix <- as.SingleCellExperiment(pbmc.rna, assay="RNA")

Because these genes are missing genomic positions, we must first annotate them with a genomic database, for example ensembl

library(AnnotationHub)
ah <- AnnotationHub()
edb <- ah[["AH98047"]]  #"EnsDb.Hsapiens.v105"
gr <- genes(edb, columns = c("gene_id", "gene_name"))

We retain only the genes that have genomic positions. The genomic positions are necessary to link peak positions to the nearby target genes.

common_genes <- na.omit(intersect(gr$gene_name, rownames(GeneExpressionMatrix)))
GeneExpressionMatrix <- GeneExpressionMatrix[common_genes,]
rowRanges(GeneExpressionMatrix) <- gr[match(common_genes, gr$gene_name)]
rowRanges(GeneExpressionMatrix)
## GRanges object with 23644 ranges and 2 metadata columns:
##                   seqnames            ranges strand |         gene_id    gene_name
##                      <Rle>         <IRanges>  <Rle> |     <character>  <character>
##   ENSG00000243485        1       29554-31109      + | ENSG00000243485  MIR1302-2HG
##   ENSG00000237613        1       34554-36081      - | ENSG00000237613      FAM138A
##   ENSG00000186092        1       65419-71585      + | ENSG00000186092        OR4F5
##   ENSG00000284733        1     450740-451678      - | ENSG00000284733       OR4F29
##   ENSG00000284662        1     685716-686654      - | ENSG00000284662       OR4F16
##               ...      ...               ...    ... .             ...          ...
##   ENSG00000228296        Y 25063083-25099892      - | ENSG00000228296       TTTY4C
##   ENSG00000223641        Y 25182277-25213389      - | ENSG00000223641      TTTY17C
##   ENSG00000228786        Y 25378300-25394719      - | ENSG00000228786 LINC00266-4P
##   ENSG00000172288        Y 25622162-25624902      + | ENSG00000172288         CDY1
##   ENSG00000231141        Y 25728490-25733388      + | ENSG00000231141        TTTY3
##   -------
##   seqinfo: 456 sequences (1 circular) from GRCh38 genome

We then convert ATAC matrix to PeakMatrix. After conversion to SingleCellExperiment, the peak positions appear as rownames and must be converted to GRanges.

PeakMatrix <- as.SingleCellExperiment(pbmc.atac, assay="ATAC")

peak_position <- strsplit(rownames(PeakMatrix), split = "-")
gr <- GRanges(
    seqnames = sapply(peak_position,"[",1),
    ranges = IRanges(start = as.numeric(sapply(peak_position,"[",2)), 
                     end = as.numeric(sapply(peak_position,"[",3)))
)
rowRanges(PeakMatrix) <- gr
PeakMatrix
## class: SingleCellExperiment 
## dim: 108377 11909 
## metadata(0):
## assays(2): counts logcounts
## rownames: NULL
## rowData names(0):
## colnames(11909): AAACAGCCAAGGAATC-1 AAACAGCCAATCCCTT-1 ... TTTGTTGGTTGGTTAG-1 TTTGTTGGTTTGCAGA-1
## colData names(5): orig.ident nCount_ATAC nFeature_ATAC seurat_annotations ident
## reducedDimNames(0):
## mainExpName: ATAC
## altExpNames(0):

For more information, refer to Signac tutorial

3.4 Start from AnnData

We download an example PBMC anndata dataset from scglue.

We first import the GeneExpressionMatrix.

library(zellkonverter)
library(GenomicRanges)
library(SingleCellExperiment)
url <- "http://download.gao-lab.org/GLUE/dataset/10x-Multiome-Pbmc10k-RNA.h5ad"
destfile <- tempfile(fileext = ".h5ad")

# Download the file
download.file(url, destfile, mode = "wb")
GeneExpressionMatrix <- readH5AD(destfile, version="0.10.6")

The field “chrom” “chromStart” “chromEnd” correspond to genomic positions of the genes.

rowData(GeneExpressionMatrix)
## DataFrame with 29095 rows and 26 columns
##                   gene_ids   feature_types   genome      chrom chromStart  chromEnd            name    score   strand thickStart
##                <character>        <factor> <factor>   <factor>  <numeric> <numeric>     <character> <factor> <factor>   <factor>
## AL627309.1 ENSG00000238009 Gene Expression   GRCh38       chr1      89294    133723 ENSG00000238009        .        -          .
## AL627309.5 ENSG00000241860 Gene Expression   GRCh38       chr1     141473    173862 ENSG00000241860        .        -          .
## AL627309.4 ENSG00000241599 Gene Expression   GRCh38       chr1     160445    161525 ENSG00000241599        .        +          .
## AP006222.2 ENSG00000286448 Gene Expression   GRCh38       chr1     266854    268655 ENSG00000286448        .        +          .
## AL669831.2 ENSG00000229905 Gene Expression   GRCh38       chr1     760910    761989 ENSG00000229905        .        +          .
## ...                    ...             ...      ...        ...        ...       ...             ...      ...      ...        ...
## AC004556.3 ENSG00000276345 Gene Expression   GRCh38 KI270721.1       2584     11802 ENSG00000276345        .        +          .
## AC233755.2 ENSG00000277856 Gene Expression   GRCh38 KI270726.1      26240     26534 ENSG00000277856        .        +          .
## AC233755.1 ENSG00000275063 Gene Expression   GRCh38 KI270726.1      41443     41876 ENSG00000275063        .        +          .
## AC007325.1 ENSG00000276017 Gene Expression   GRCh38 KI270734.1      72410     74814 ENSG00000276017        .        +          .
## AC007325.4 ENSG00000278817 Gene Expression   GRCh38 KI270734.1     131493    137392 ENSG00000278817        .        +          .
##            thickEnd  itemRgb blockCount blockSizes blockStarts      gene_type  gene_name  hgnc_id          havana_gene
##            <factor> <factor>   <factor>   <factor>    <factor>       <factor>   <factor> <factor>             <factor>
## AL627309.1        .        .          .          .           .         lncRNA AL627309.1       NA OTTHUMG00000001096.2
## AL627309.5        .        .          .          .           .         lncRNA AL627309.5       NA OTTHUMG00000002480.4
## AL627309.4        .        .          .          .           .         lncRNA AL627309.4       NA OTTHUMG00000002525.1
## AP006222.2        .        .          .          .           .         lncRNA AP006222.2       NA OTTHUMG00000194680.1
## AL669831.2        .        .          .          .           .         lncRNA AL669831.2       NA OTTHUMG00000002408.1
## ...             ...      ...        ...        ...         ...            ...        ...      ...                  ...
## AC004556.3        .        .          .          .           . protein_coding AC004556.3       NA                   NA
## AC233755.2        .        .          .          .           . protein_coding AC233755.2       NA                   NA
## AC233755.1        .        .          .          .           . protein_coding AC233755.1       NA                   NA
## AC007325.1        .        .          .          .           . protein_coding AC007325.1       NA                   NA
## AC007325.4        .        .          .          .           . protein_coding AC007325.4       NA                   NA
##                          tag  n_counts highly_variable highly_variable_rank       means   variances variances_norm
##                     <factor> <numeric>       <logical>            <numeric>   <numeric>   <numeric>      <numeric>
## AL627309.1 overlapping_locus        70           FALSE                  NaN 0.007268196 0.008462225       0.971895
## AL627309.5 ncRNA_host              442           FALSE                  NaN 0.045893469 0.050853072       0.888672
## AL627309.4 NA                       44           FALSE                  NaN 0.004568581 0.004755865       0.891707
## AP006222.2 NA                        1           FALSE                  NaN 0.000103831 0.000103831       0.999904
## AL669831.2 NA                       10           FALSE                  NaN 0.001038314 0.001037343       0.933916
## ...                      ...       ...             ...                  ...         ...         ...            ...
## AC004556.3                NA       320           FALSE                  NaN 0.033226041 0.035448356       0.856870
## AC233755.2                NA         1           FALSE                  NaN 0.000103831 0.000103831       0.999904
## AC233755.1                NA         1           FALSE                  NaN 0.000103831 0.000103831       0.999904
## AC007325.1                NA         3           FALSE                  NaN 0.000311494 0.000311429       0.975766
## AC007325.4                NA        43           FALSE                  NaN 0.004464749 0.004445277       0.854162

They must be renamed to “seqnames” “start” and “end” before conversion to GRanges

index_to_rename <- match(c("chrom", "chromStart",  "chromEnd"), colnames(rowData(GeneExpressionMatrix)))
colnames(rowData(GeneExpressionMatrix))[index_to_rename] <- c("seqnames", "start",   "end")
rowRanges(GeneExpressionMatrix) <- GRanges(rowData(GeneExpressionMatrix))
rowRanges(GeneExpressionMatrix)
## GRanges object with 29095 ranges and 22 metadata columns:
##                seqnames        ranges strand |        gene_ids   feature_types   genome            name    score thickStart
##                   <Rle>     <IRanges>  <Rle> |     <character>        <factor> <factor>     <character> <factor>   <factor>
##   AL627309.1       chr1  89294-133723      - | ENSG00000238009 Gene Expression   GRCh38 ENSG00000238009        .          .
##   AL627309.5       chr1 141473-173862      - | ENSG00000241860 Gene Expression   GRCh38 ENSG00000241860        .          .
##   AL627309.4       chr1 160445-161525      + | ENSG00000241599 Gene Expression   GRCh38 ENSG00000241599        .          .
##   AP006222.2       chr1 266854-268655      + | ENSG00000286448 Gene Expression   GRCh38 ENSG00000286448        .          .
##   AL669831.2       chr1 760910-761989      + | ENSG00000229905 Gene Expression   GRCh38 ENSG00000229905        .          .
##          ...        ...           ...    ... .             ...             ...      ...             ...      ...        ...
##   AC004556.3 KI270721.1    2584-11802      + | ENSG00000276345 Gene Expression   GRCh38 ENSG00000276345        .          .
##   AC233755.2 KI270726.1   26240-26534      + | ENSG00000277856 Gene Expression   GRCh38 ENSG00000277856        .          .
##   AC233755.1 KI270726.1   41443-41876      + | ENSG00000275063 Gene Expression   GRCh38 ENSG00000275063        .          .
##   AC007325.1 KI270734.1   72410-74814      + | ENSG00000276017 Gene Expression   GRCh38 ENSG00000276017        .          .
##   AC007325.4 KI270734.1 131493-137392      + | ENSG00000278817 Gene Expression   GRCh38 ENSG00000278817        .          .
##              thickEnd  itemRgb blockCount blockSizes blockStarts      gene_type  gene_name  hgnc_id          havana_gene
##              <factor> <factor>   <factor>   <factor>    <factor>       <factor>   <factor> <factor>             <factor>
##   AL627309.1        .        .          .          .           .         lncRNA AL627309.1       NA OTTHUMG00000001096.2
##   AL627309.5        .        .          .          .           .         lncRNA AL627309.5       NA OTTHUMG00000002480.4
##   AL627309.4        .        .          .          .           .         lncRNA AL627309.4       NA OTTHUMG00000002525.1
##   AP006222.2        .        .          .          .           .         lncRNA AP006222.2       NA OTTHUMG00000194680.1
##   AL669831.2        .        .          .          .           .         lncRNA AL669831.2       NA OTTHUMG00000002408.1
##          ...      ...      ...        ...        ...         ...            ...        ...      ...                  ...
##   AC004556.3        .        .          .          .           . protein_coding AC004556.3       NA                   NA
##   AC233755.2        .        .          .          .           . protein_coding AC233755.2       NA                   NA
##   AC233755.1        .        .          .          .           . protein_coding AC233755.1       NA                   NA
##   AC007325.1        .        .          .          .           . protein_coding AC007325.1       NA                   NA
##   AC007325.4        .        .          .          .           . protein_coding AC007325.4       NA                   NA
##                            tag  n_counts highly_variable highly_variable_rank       means   variances variances_norm
##                       <factor> <numeric>       <logical>            <numeric>   <numeric>   <numeric>      <numeric>
##   AL627309.1 overlapping_locus        70           FALSE                  NaN 0.007268196 0.008462225       0.971895
##   AL627309.5 ncRNA_host              442           FALSE                  NaN 0.045893469 0.050853072       0.888672
##   AL627309.4 NA                       44           FALSE                  NaN 0.004568581 0.004755865       0.891707
##   AP006222.2 NA                        1           FALSE                  NaN 0.000103831 0.000103831       0.999904
##   AL669831.2 NA                       10           FALSE                  NaN 0.001038314 0.001037343       0.933916
##          ...               ...       ...             ...                  ...         ...         ...            ...
##   AC004556.3                NA       320           FALSE                  NaN 0.033226041 0.035448356       0.856870
##   AC233755.2                NA         1           FALSE                  NaN 0.000103831 0.000103831       0.999904
##   AC233755.1                NA         1           FALSE                  NaN 0.000103831 0.000103831       0.999904
##   AC007325.1                NA         3           FALSE                  NaN 0.000311494 0.000311429       0.975766
##   AC007325.4                NA        43           FALSE                  NaN 0.004464749 0.004445277       0.854162
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths

We next import the PeakMatrix. The rownames are already in the format of seqnames:start-end, so rowRanges can be extracted directly from rowData.

library(zellkonverter)
url <- "http://download.gao-lab.org/GLUE/dataset/10x-Multiome-Pbmc10k-ATAC.h5ad"
destfile <- tempfile(fileext = ".h5ad")

# Download the file
download.file(url, destfile, mode = "wb")

PeakMatrix <- readH5AD(destfile, version="0.10.6")
rowRanges(PeakMatrix) <- GRanges(rowData(PeakMatrix))
rowRanges(PeakMatrix)
## GRanges object with 107194 ranges and 3 metadata columns:
##                            seqnames        ranges strand | feature_types   genome  n_counts
##                               <Rle>     <IRanges>  <Rle> |      <factor> <factor> <numeric>
##       chr1:816881-817647       chr1 816881-817647      * |         Peaks   GRCh38      1025
##       chr1:819912-823500       chr1 819912-823500      * |         Peaks   GRCh38      1384
##       chr1:825827-825889       chr1 825827-825889      * |         Peaks   GRCh38        20
##       chr1:826612-827979       chr1 826612-827979      * |         Peaks   GRCh38      4555
##       chr1:841243-843059       chr1 841243-843059      * |         Peaks   GRCh38       555
##                      ...        ...           ...    ... .           ...      ...       ...
##   KI270713.1:20444-22615 KI270713.1   20444-22615      * |         Peaks   GRCh38     12640
##   KI270713.1:27118-28927 KI270713.1   27118-28927      * |         Peaks   GRCh38       533
##   KI270713.1:29485-30706 KI270713.1   29485-30706      * |         Peaks   GRCh38       622
##   KI270713.1:31511-32072 KI270713.1   31511-32072      * |         Peaks   GRCh38       207
##   KI270713.1:37129-37638 KI270713.1   37129-37638      * |         Peaks   GRCh38       388
##   -------
##   seqinfo: 32 sequences from an unspecified genome; no seqlengths

This example SingleCellExperiment dataset is missing the dimensionality reduction information. Users should have been this precalculated. Refer to the section on how to import from csv files, or refer to OSCA book on how to perform dimensionality reduction.

3.5 Start from 10x data formats

3.5.1 Read from .h5 file

We first download the publicly available PBMC data from 10x Genomics

url <- "https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_granulocyte_sorted_10k/pbmc_granulocyte_sorted_10k_filtered_feature_bc_matrix.h5"
destfile <- tempfile(fileext = ".h5")  

# Download the file
download.file(url, destfile, mode = "wb")

This .h5 file contains both genes and peak regions as features, so we read in all the features into a single SingleCellExperiment, and we can specify “Gene Expression” as the main experiment and “Peaks” as the alternative experiment. We then split this single object into 2 separate SingleCellExperiment objects for consistency with other sections.

library(DropletUtils)
sce10x <- read10xCounts(destfile)

sce10x <- splitAltExps(sce10x, f=rowData(sce10x)$Type, ref="Gene Expression")

peakMatrix <- altExp(sce10x)
rowRanges(peakMatrix) <- GRanges(rownames(peakMatrix))

GeneExpressionMatrix <- removeAltExps(sce10x)

3.5.2 Read directly from a 10x directory

We can also create a SingleCellExperiment object from a directory containing “matrix.mtx.gz”, “barcodes.tsv.gz” and “features.tsv.gz”

example(write10xCounts)

list.files(tmpdir)

sce10x <- read10xCounts(tmpdir)

Refer to the section on Constructing a SingleCellExperiment object on how to construct SingleCellExperiment, add colData, rowData and reduced dimensions.

3.6 Read directly from CSV

If the count matrices are in the form of csv files, read in the count matrix as a sparse matrix and convert it to a dgCMatrix.

library(SparseArray)
counts <- readSparseCSV("counts.csv")
counts <- as(counts, "dgCMatrix")

Then read in the genomic positions as data.frame and convert it to a GRanges.

peak_position <- read.csv("peaks.csv")
peak_position <- data.frame(peak_position)
gr <- GenomicRanges::makeGRangesFromDataFrame(peak_position)

Refer to the section on Constructing a SingleCellExperiment object on how to construct SingleCellExperiment, add colData, rowData and reduced dimensions.

3.7 Important points

Finally, we would like to emphasize again the importance of converting all count matrices to dgCMatrix for speed and compatibility. Both gene expression and peak matrices require rowRanges to indicate genomic positions of target gene position and peak position.