The servers linked by TFregulomeR

TFregulomeR data compendium is hosted in Singapore and Canada. As an API, TFregulomeR (from v2.0.0) will dynamically access and retrieve the data either from Singapore (default) or Canada server. The functions in TFregulomeR that dynamically link to the servers include dataBrowser(), loadPeaks(), searchMotif(), commonPeaks(), exclusivePeaks(), intersectPeakMatrix(), motifDistrib(), genomeAnnotate() and toTFBSTools(). By default, these functions are linking to Singapore server. If users opt to choose the Canada server, they can use the input parameter server='ca' when using these functions.

Browse TFregulomeR data compendium

TFregulomeR allows users to easily browse TFregulomeR compendium using only one simple function dataBrowser (previously called TFBSBrowser before v1.2.0). Users can search the TFBS according to species, organ, sample type, cell/tissue name, TF name, disease state, and source. A data.frame will be returned upon searching, including the information in TFregulomeR ID (important for downstream analysis), species, organ, sample type, cell/tissue name, description of cell/tissue, disease state, TF name, source, TF source ID, number of peaks and number of peaks with motif. In particular, if no input is given for the function, all records in TFregulomeR compendium will be returned.

In TFregulomeR project, we used MACS2 (Zhang et al., 2008) to call peaks from ChIP-seq datasets and subsequently employed MEME-ChIP (Machanick et al., 2011) to perform de novo motif discovery in each peak set. MACS2 has been proved as one of the best peak callers in terms of sensitivity, precision and F-score metrics in a study encompassing 300 simulated ChIP-seq datasets with different noise levels (Thomas et al., 2017), while MEME-ChIP is one of the most commonly used motif callers and also utilised in both JASPAR 2010 and 2014 versions (Portales-Casamar et al., 2010; Mathelier et al., 2014). Highly and centrally enriched motifs were selected and compared with the existing TF-binding profile databases, such as HOCOMOCO (Kulakovskiy et al., 2018) and JASPAR. 91 highly enriched motifs were not consistent with the TF-binding profile databases. This is presumably due to the fact that in those cell types, TFs are indirectly recruited to genome, and/or that the high presence of some non-targeted motifs repeatedly observed across ChIP-seq datasets, also known as zinger motifs, mask the motif enrichment of the ChIP’ed TF (Hunt et al., 2014). Furthermore, 136 motifs were not recorded for their corresponding TFs in the databases. In order to verify that divergent motifs do not derive from the use of a specific motif discovery algorithm, we used another software, HOMER (Heinz et al., 2010), based on hypergeometric enrichment, to perform an additional de novo motif discovery. Motif results by HOMER were compared with those by MEME-ChIP and their similarity were measured by normalised Pearson correlation coefficient using compare-matrices function in RSAT (Nguyen et al., 2018) with the formula: Ncor = cor * w / w_smaller, where cor is raw Pearson correlation coefficient, w is the alignment width of two matrices from MEME-ChIP and HOMER (the minimum value of w was set as 5), and w_smaller is the width of smaller motifs from MEME-ChIP and HOMER. We found that majority of the PWM matrices generated with MEME-ChIP, was recapitulated with HOMER. We have added the information into the last two columns of dataBrowser output (from v1.2.0).

Figure 1. Similarity of de novo enriched motifs by MEME-ChIP and HOMER. The beeswarm and violin plots show the normalised Pearson correlation coefficient of de novo motifs called by MEME-ChIP and HOMER, and the red dash denotes normalised Pearson correlation coefficient value 0.7.

Figure 1. Similarity of de novo enriched motifs by MEME-ChIP and HOMER. The beeswarm and violin plots show the normalised Pearson correlation coefficient of de novo motifs called by MEME-ChIP and HOMER, and the red dash denotes normalised Pearson correlation coefficient value 0.7.

library(TFregulomeR)
# browse all records in TFregulomeR TFBS compendium
all_record <- dataBrowser() # or TFBSBrowser() before v1.2.0
#> 1468 record(s) found: ...
#> ... covering 415 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 29 organ(s):
#> ... ... stem_cell, blood_and_lymph, connective_tissue, colorectum, brain, bone, stomach, prostate, breast, pancreas, skin, kidney, lung, eye, esophagus, heart, muscle, uterus, spleen, cervix, testis, liver, adrenal_gland, neck_and_mouth, pleura, ovary, thymus, fallopian, vagina
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  414  different cell(s) or tissue(s)
#> ... in 8 type(s) of disease state(s):
#> ... ... normal, tumor, Simpson_Golabi_Behmel_syndrome, progeria, metaplasia, unknown, immortalized, premetastatic
#> ... from the source(s): GTRD, MethMotif

# returned table
head(all_record)
#>                                               ID species             organ   sample_type
#> 1 GTRD-EXP000061_HSA_embryonic-stem-cells_PRDM14   human         stem_cell primary_cells
#> 2          GTRD-EXP000080_HSA_CD4pos-T-cells_YY1   human   blood_and_lymph primary_cells
#> 3                 GTRD-EXP000128_HSA_EWS502_FLI1   human connective_tissue     cell_line
#> 4                  GTRD-EXP000132_HSA_HUVEC_FLI1   human connective_tissue     cell_line
#> 5                GTRD-EXP000140_HSA_LS180_TCF7L2   human        colorectum     cell_line
#> 6                 GTRD-EXP000142_HSA_LS180_CEBPB   human        colorectum     cell_line
#>       cell_tissue_name                      description disease_state     TF source source_ID
#> 1 embryonic-stem-cells             embryonic stem cells        normal PRDM14   GTRD EXP000061
#> 2       CD4pos-T-cells                     CD4+ T-cells        normal    YY1   GTRD EXP000080
#> 3               EWS502                    Ewing sarcoma         tumor   FLI1   GTRD EXP000128
#> 4                HUVEC umbilical vein endothelial cells        normal   FLI1   GTRD EXP000132
#> 5                LS180                     colon cancer         tumor TCF7L2   GTRD EXP000140
#> 6                LS180                     colon cancer         tumor  CEBPB   GTRD EXP000142
#>   peak_num peak_with_motif_num Consistent_with_HOCOMOCO_JASPAR Ncor_between_MEME_ChIP_and_HOMER
#> 1    17482                5656                             YES                               NA
#> 2    18298                2038                             YES                               NA
#> 3    89523               28914                             YES                               NA
#> 4    71437               36126                             YES                               NA
#> 5    16517                2312                             YES                               NA
#> 6   128542               67747                             YES                               NA

# browse TFBSs in blood and lymph
blood_and_lymph_record <- dataBrowser(organ = "blood_and_lymph") # or TFBSBrowser() before v1.2.0
#> 494 record(s) found: ...
#> ... covering 197 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  129  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... normal, tumor
#> ... from the source(s): GTRD, MethMotif

# browse all CEBPB TFBSs
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, 
#> breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

Retrieve datasets from TFregulomeR data warehouse

We have designed some useful functions for data retrieval from TFregulomeR compendium. Retrieval of motif matrix and DNA methylation matrix if the source is MethMotif (If the source is GTRD, no DNA methylation information is available) can be achieved using searchMotif. Further, these obtained matrices can be easily saved locally using exportMMPFM, and corresponding (Meth)Motif logo (MethMotif logo, if the source is MethMotif; motif logo, if the source is GTRD) can be simply plotted using plotLogo. Here, we introduced an object of class “MethMotif” using S4 class in order for an easy and intuitive storage, manipulation and conversion (with other packages such as “TFBSTools”) of a MethMotif matrix, which contains a TF motif weight position matrix and its DNA methylation matrix (beta score matrix). What’s more, we allow users to directly load peak regions of a TF of interest (all peaks or peaks with motif only) from TFregulomeR compendium using loadPeaks.

# according to TFBSBrowser results for all CEBPB TFBS query, we select two CEBPB TFBSs 
# from MethMotif and GTRD: MM1_HSA_K562_CEBPB, GTRD-EXP040801_HSA_HL-60_CEBPB.

# loading MethMotif object in "MEME" format. Currently we support "MEME" and "TRANSFAC".
K562_CEBPB <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
HL60_CEBPB <- searchMotif(id = "GTRD-EXP040801_HSA_HL-60_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
class(K562_CEBPB)
#> [1] "MethMotif"
#> attr(,"package")
#> [1] "TFregulomeR"

After obtaining a MethMotif matrix, user can use plotLogo to plot logo as below (Figure 2). If the TFBS source is MethMotif, then a MethMotif logo will be saved. Two options are available for motif logo, “entropy” and “frequency”, and also different methylation levels (“all”, “methylated” and “unmethylated”) can be opted for methylation bar charts. However, if the TFBS source is GTRD, only a motif logo will be saved due to the unknown DNA methylation within motif.

In the plot, the number of peaks with motif will be printed. It should be NOTED that the motif logo is generated by the all possible TFBSs (initally scanned by FIMO after de novo motif discovery using MEME-ChIP suite with a p-value less than 1e-4) in the peak regions (+/-100bp around peak summits). It’s possbile that one peak region contains more than one significant TFBSs. Hence, the number of TFBSs in the peak regions could be larger than the number of peaks with motif.

For each MethMotif logo, bar plot above motif logo denotes the number of cytosines in CpG context covered by WGBS at each base position in motif and these CpGs are segregated into three groups shown in different colors, namely homogeneously methylated (orange bar, beta score > 90%), heterogeneously methylated (green bar, beta score 10-90%), and homogenously unmethylated (blue bar, beta score < 10%).

plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-unmethylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-unmethylated-only.pdf' has been saved!

# plot motif logo for GTRD-EXP040801_HSA_HL-60_CEBPB. No DNA methylation states available here
plotLogo(HL60_CEBPB, logo_type = "entropy")
#> Success: a PDF named 'GTRD-EXP040801_HSA_HL-60_CEBPB-logo-entropy.pdf' has been saved!