Abstract
Transcription factors (TFs) are sequence-specific binding proteins, fine-tuning the spatiotemporal gene expression. Since the genomic occupancy of TF is highly dynamic, it is crucial to study TF binding sites (TFBS) in a cell-specific and in vivo context. Here, we introduce TFregulomeR, an R-package linked to a large timely-updated compendium of cistrome and methylome datasets, implemented with functionalities that facilitate the manipulation and analysis of TFBS and methylome meta-data. In particular, TFregulomeR permits the characterisation of TF binding partners and cell-specific TFBSs, along with the study of TF’s functions in the context of different partners’ combinations and DNA methylation levels. TFregulomeR package version: 2.0.0TFregulomeR data compendium is hosted in Singapore and Canada. As an API, TFregulomeR (from v2.0.0) will dynamically access and retrieve the data either from Singapore (default) or Canada server. The functions in TFregulomeR that dynamically link to the servers include dataBrowser()
, loadPeaks()
, searchMotif()
, commonPeaks()
, exclusivePeaks()
, intersectPeakMatrix()
, motifDistrib()
, genomeAnnotate()
and toTFBSTools()
. By default, these functions are linking to Singapore server. If users opt to choose the Canada server, they can use the input parameter server='ca'
when using these functions.
TFregulomeR allows users to easily browse TFregulomeR compendium using only one simple function dataBrowser
(previously called TFBSBrowser
before v1.2.0). Users can search the TFBS according to species, organ, sample type, cell/tissue name, TF name, disease state, and source. A data.frame will be returned upon searching, including the information in TFregulomeR ID (important for downstream analysis), species, organ, sample type, cell/tissue name, description of cell/tissue, disease state, TF name, source, TF source ID, number of peaks and number of peaks with motif. In particular, if no input is given for the function, all records in TFregulomeR compendium will be returned.
In TFregulomeR project, we used MACS2 (Zhang et al., 2008) to call peaks from ChIP-seq datasets and subsequently employed MEME-ChIP (Machanick et al., 2011) to perform de novo motif discovery in each peak set. MACS2 has been proved as one of the best peak callers in terms of sensitivity, precision and F-score metrics in a study encompassing 300 simulated ChIP-seq datasets with different noise levels (Thomas et al., 2017), while MEME-ChIP is one of the most commonly used motif callers and also utilised in both JASPAR 2010 and 2014 versions (Portales-Casamar et al., 2010; Mathelier et al., 2014). Highly and centrally enriched motifs were selected and compared with the existing TF-binding profile databases, such as HOCOMOCO (Kulakovskiy et al., 2018) and JASPAR. 91 highly enriched motifs were not consistent with the TF-binding profile databases. This is presumably due to the fact that in those cell types, TFs are indirectly recruited to genome, and/or that the high presence of some non-targeted motifs repeatedly observed across ChIP-seq datasets, also known as zinger motifs, mask the motif enrichment of the ChIP’ed TF (Hunt et al., 2014). Furthermore, 136 motifs were not recorded for their corresponding TFs in the databases. In order to verify that divergent motifs do not derive from the use of a specific motif discovery algorithm, we used another software, HOMER (Heinz et al., 2010), based on hypergeometric enrichment, to perform an additional de novo motif discovery. Motif results by HOMER were compared with those by MEME-ChIP and their similarity were measured by normalised Pearson correlation coefficient using compare-matrices function in RSAT (Nguyen et al., 2018) with the formula: Ncor = cor * w / w_smaller, where cor is raw Pearson correlation coefficient, w is the alignment width of two matrices from MEME-ChIP and HOMER (the minimum value of w was set as 5), and w_smaller is the width of smaller motifs from MEME-ChIP and HOMER. We found that majority of the PWM matrices generated with MEME-ChIP, was recapitulated with HOMER. We have added the information into the last two columns of dataBrowser
output (from v1.2.0).
Figure 1. Similarity of de novo enriched motifs by MEME-ChIP and HOMER. The beeswarm and violin plots show the normalised Pearson correlation coefficient of de novo motifs called by MEME-ChIP and HOMER, and the red dash denotes normalised Pearson correlation coefficient value 0.7.
library(TFregulomeR)
# browse all records in TFregulomeR TFBS compendium
all_record <- dataBrowser() # or TFBSBrowser() before v1.2.0
#> 1468 record(s) found: ...
#> ... covering 415 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 29 organ(s):
#> ... ... stem_cell, blood_and_lymph, connective_tissue, colorectum, brain, bone, stomach, prostate, breast, pancreas, skin, kidney, lung, eye, esophagus, heart, muscle, uterus, spleen, cervix, testis, liver, adrenal_gland, neck_and_mouth, pleura, ovary, thymus, fallopian, vagina
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in 414 different cell(s) or tissue(s)
#> ... in 8 type(s) of disease state(s):
#> ... ... normal, tumor, Simpson_Golabi_Behmel_syndrome, progeria, metaplasia, unknown, immortalized, premetastatic
#> ... from the source(s): GTRD, MethMotif
# returned table
head(all_record)
#> ID species organ sample_type
#> 1 GTRD-EXP000061_HSA_embryonic-stem-cells_PRDM14 human stem_cell primary_cells
#> 2 GTRD-EXP000080_HSA_CD4pos-T-cells_YY1 human blood_and_lymph primary_cells
#> 3 GTRD-EXP000128_HSA_EWS502_FLI1 human connective_tissue cell_line
#> 4 GTRD-EXP000132_HSA_HUVEC_FLI1 human connective_tissue cell_line
#> 5 GTRD-EXP000140_HSA_LS180_TCF7L2 human colorectum cell_line
#> 6 GTRD-EXP000142_HSA_LS180_CEBPB human colorectum cell_line
#> cell_tissue_name description disease_state TF source source_ID
#> 1 embryonic-stem-cells embryonic stem cells normal PRDM14 GTRD EXP000061
#> 2 CD4pos-T-cells CD4+ T-cells normal YY1 GTRD EXP000080
#> 3 EWS502 Ewing sarcoma tumor FLI1 GTRD EXP000128
#> 4 HUVEC umbilical vein endothelial cells normal FLI1 GTRD EXP000132
#> 5 LS180 colon cancer tumor TCF7L2 GTRD EXP000140
#> 6 LS180 colon cancer tumor CEBPB GTRD EXP000142
#> peak_num peak_with_motif_num Consistent_with_HOCOMOCO_JASPAR Ncor_between_MEME_ChIP_and_HOMER
#> 1 17482 5656 YES NA
#> 2 18298 2038 YES NA
#> 3 89523 28914 YES NA
#> 4 71437 36126 YES NA
#> 5 16517 2312 YES NA
#> 6 128542 67747 YES NA
# browse TFBSs in blood and lymph
blood_and_lymph_record <- dataBrowser(organ = "blood_and_lymph") # or TFBSBrowser() before v1.2.0
#> 494 record(s) found: ...
#> ... covering 197 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in 129 different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... normal, tumor
#> ... from the source(s): GTRD, MethMotif
# browse all CEBPB TFBSs
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver,
#> breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in 16 different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif
We have designed some useful functions for data retrieval from TFregulomeR compendium. Retrieval of motif matrix and DNA methylation matrix if the source is MethMotif (If the source is GTRD, no DNA methylation information is available) can be achieved using searchMotif
. Further, these obtained matrices can be easily saved locally using exportMMPFM
, and corresponding (Meth)Motif logo (MethMotif logo, if the source is MethMotif; motif logo, if the source is GTRD) can be simply plotted using plotLogo
. Here, we introduced an object of class “MethMotif” using S4 class in order for an easy and intuitive storage, manipulation and conversion (with other packages such as “TFBSTools”) of a MethMotif matrix, which contains a TF motif weight position matrix and its DNA methylation matrix (beta score matrix). What’s more, we allow users to directly load peak regions of a TF of interest (all peaks or peaks with motif only) from TFregulomeR compendium using loadPeaks
.
# according to TFBSBrowser results for all CEBPB TFBS query, we select two CEBPB TFBSs
# from MethMotif and GTRD: MM1_HSA_K562_CEBPB, GTRD-EXP040801_HSA_HL-60_CEBPB.
# loading MethMotif object in "MEME" format. Currently we support "MEME" and "TRANSFAC".
K562_CEBPB <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
HL60_CEBPB <- searchMotif(id = "GTRD-EXP040801_HSA_HL-60_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
class(K562_CEBPB)
#> [1] "MethMotif"
#> attr(,"package")
#> [1] "TFregulomeR"
After obtaining a MethMotif matrix, user can use plotLogo
to plot logo as below (Figure 2). If the TFBS source is MethMotif, then a MethMotif logo will be saved. Two options are available for motif logo, “entropy” and “frequency”, and also different methylation levels (“all”, “methylated” and “unmethylated”) can be opted for methylation bar charts. However, if the TFBS source is GTRD, only a motif logo will be saved due to the unknown DNA methylation within motif.
In the plot, the number of peaks with motif will be printed. It should be NOTED that the motif logo is generated by the all possible TFBSs (initally scanned by FIMO after de novo motif discovery using MEME-ChIP suite with a p-value less than 1e-4) in the peak regions (+/-100bp around peak summits). It’s possbile that one peak region contains more than one significant TFBSs. Hence, the number of TFBSs in the peak regions could be larger than the number of peaks with motif.
For each MethMotif logo, bar plot above motif logo denotes the number of cytosines in CpG context covered by WGBS at each base position in motif and these CpGs are segregated into three groups shown in different colors, namely homogeneously methylated (orange bar, beta score > 90%), heterogeneously methylated (green bar, beta score 10-90%), and homogenously unmethylated (blue bar, beta score < 10%).
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-unmethylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-unmethylated-only.pdf' has been saved!
# plot motif logo for GTRD-EXP040801_HSA_HL-60_CEBPB. No DNA methylation states available here
plotLogo(HL60_CEBPB, logo_type = "entropy")
#> Success: a PDF named 'GTRD-EXP040801_HSA_HL-60_CEBPB-logo-entropy.pdf' has been saved!