The servers linked by TFregulomeR

TFregulomeR data compendium is hosted in Singapore and Canada. As an API, TFregulomeR (from v2.0.0) will dynamically access and retrieve the data either from Singapore (default) or Canada server. The functions in TFregulomeR that dynamically link to the servers include dataBrowser(), loadPeaks(), searchMotif(), commonPeaks(), exclusivePeaks(), intersectPeakMatrix(), motifDistrib(), genomeAnnotate() and toTFBSTools(). By default, these functions are linking to Singapore server. If users opt to choose the Canada server, they can use the input parameter server='ca' when using these functions.

Browse TFregulomeR data compendium

TFregulomeR allows users to easily browse TFregulomeR compendium using only one simple function dataBrowser (previously called TFBSBrowser before v1.2.0). Users can search the TFBS according to species, organ, sample type, cell/tissue name, TF name, disease state, and source. A data.frame will be returned upon searching, including the information in TFregulomeR ID (important for downstream analysis), species, organ, sample type, cell/tissue name, description of cell/tissue, disease state, TF name, source, TF source ID, number of peaks and number of peaks with motif. In particular, if no input is given for the function, all records in TFregulomeR compendium will be returned.

In TFregulomeR project, we used MACS2 (Zhang et al., 2008) to call peaks from ChIP-seq datasets and subsequently employed MEME-ChIP (Machanick et al., 2011) to perform de novo motif discovery in each peak set. MACS2 has been proved as one of the best peak callers in terms of sensitivity, precision and F-score metrics in a study encompassing 300 simulated ChIP-seq datasets with different noise levels (Thomas et al., 2017), while MEME-ChIP is one of the most commonly used motif callers and also utilised in both JASPAR 2010 and 2014 versions (Portales-Casamar et al., 2010; Mathelier et al., 2014). Highly and centrally enriched motifs were selected and compared with the existing TF-binding profile databases, such as HOCOMOCO (Kulakovskiy et al., 2018) and JASPAR. 91 highly enriched motifs were not consistent with the TF-binding profile databases. This is presumably due to the fact that in those cell types, TFs are indirectly recruited to genome, and/or that the high presence of some non-targeted motifs repeatedly observed across ChIP-seq datasets, also known as zinger motifs, mask the motif enrichment of the ChIP’ed TF (Hunt et al., 2014). Furthermore, 136 motifs were not recorded for their corresponding TFs in the databases. In order to verify that divergent motifs do not derive from the use of a specific motif discovery algorithm, we used another software, HOMER (Heinz et al., 2010), based on hypergeometric enrichment, to perform an additional de novo motif discovery. Motif results by HOMER were compared with those by MEME-ChIP and their similarity were measured by normalised Pearson correlation coefficient using compare-matrices function in RSAT (Nguyen et al., 2018) with the formula: Ncor = cor * w / w_smaller, where cor is raw Pearson correlation coefficient, w is the alignment width of two matrices from MEME-ChIP and HOMER (the minimum value of w was set as 5), and w_smaller is the width of smaller motifs from MEME-ChIP and HOMER. We found that majority of the PWM matrices generated with MEME-ChIP, was recapitulated with HOMER. We have added the information into the last two columns of dataBrowser output (from v1.2.0).

Figure 1. Similarity of de novo enriched motifs by MEME-ChIP and HOMER. The beeswarm and violin plots show the normalised Pearson correlation coefficient of de novo motifs called by MEME-ChIP and HOMER, and the red dash denotes normalised Pearson correlation coefficient value 0.7.

Figure 1. Similarity of de novo enriched motifs by MEME-ChIP and HOMER. The beeswarm and violin plots show the normalised Pearson correlation coefficient of de novo motifs called by MEME-ChIP and HOMER, and the red dash denotes normalised Pearson correlation coefficient value 0.7.

library(TFregulomeR)
# browse all records in TFregulomeR TFBS compendium
all_record <- dataBrowser() # or TFBSBrowser() before v1.2.0
#> 1468 record(s) found: ...
#> ... covering 415 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 29 organ(s):
#> ... ... stem_cell, blood_and_lymph, connective_tissue, colorectum, brain, bone, stomach, prostate, breast, pancreas, skin, kidney, lung, eye, esophagus, heart, muscle, uterus, spleen, cervix, testis, liver, adrenal_gland, neck_and_mouth, pleura, ovary, thymus, fallopian, vagina
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  414  different cell(s) or tissue(s)
#> ... in 8 type(s) of disease state(s):
#> ... ... normal, tumor, Simpson_Golabi_Behmel_syndrome, progeria, metaplasia, unknown, immortalized, premetastatic
#> ... from the source(s): GTRD, MethMotif

# returned table
head(all_record)
#>                                               ID species             organ   sample_type
#> 1 GTRD-EXP000061_HSA_embryonic-stem-cells_PRDM14   human         stem_cell primary_cells
#> 2          GTRD-EXP000080_HSA_CD4pos-T-cells_YY1   human   blood_and_lymph primary_cells
#> 3                 GTRD-EXP000128_HSA_EWS502_FLI1   human connective_tissue     cell_line
#> 4                  GTRD-EXP000132_HSA_HUVEC_FLI1   human connective_tissue     cell_line
#> 5                GTRD-EXP000140_HSA_LS180_TCF7L2   human        colorectum     cell_line
#> 6                 GTRD-EXP000142_HSA_LS180_CEBPB   human        colorectum     cell_line
#>       cell_tissue_name                      description disease_state     TF source source_ID
#> 1 embryonic-stem-cells             embryonic stem cells        normal PRDM14   GTRD EXP000061
#> 2       CD4pos-T-cells                     CD4+ T-cells        normal    YY1   GTRD EXP000080
#> 3               EWS502                    Ewing sarcoma         tumor   FLI1   GTRD EXP000128
#> 4                HUVEC umbilical vein endothelial cells        normal   FLI1   GTRD EXP000132
#> 5                LS180                     colon cancer         tumor TCF7L2   GTRD EXP000140
#> 6                LS180                     colon cancer         tumor  CEBPB   GTRD EXP000142
#>   peak_num peak_with_motif_num Consistent_with_HOCOMOCO_JASPAR Ncor_between_MEME_ChIP_and_HOMER
#> 1    17482                5656                             YES                               NA
#> 2    18298                2038                             YES                               NA
#> 3    89523               28914                             YES                               NA
#> 4    71437               36126                             YES                               NA
#> 5    16517                2312                             YES                               NA
#> 6   128542               67747                             YES                               NA

# browse TFBSs in blood and lymph
blood_and_lymph_record <- dataBrowser(organ = "blood_and_lymph") # or TFBSBrowser() before v1.2.0
#> 494 record(s) found: ...
#> ... covering 197 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 3 sample type(s):
#> ... ... primary_cells, cell_line, tissue
#> ... in  129  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... normal, tumor
#> ... from the source(s): GTRD, MethMotif

# browse all CEBPB TFBSs
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, 
#> breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

Retrieve datasets from TFregulomeR data warehouse

We have designed some useful functions for data retrieval from TFregulomeR compendium. Retrieval of motif matrix and DNA methylation matrix if the source is MethMotif (If the source is GTRD, no DNA methylation information is available) can be achieved using searchMotif. Further, these obtained matrices can be easily saved locally using exportMMPFM, and corresponding (Meth)Motif logo (MethMotif logo, if the source is MethMotif; motif logo, if the source is GTRD) can be simply plotted using plotLogo. Here, we introduced an object of class “MethMotif” using S4 class in order for an easy and intuitive storage, manipulation and conversion (with other packages such as “TFBSTools”) of a MethMotif matrix, which contains a TF motif weight position matrix and its DNA methylation matrix (beta score matrix). What’s more, we allow users to directly load peak regions of a TF of interest (all peaks or peaks with motif only) from TFregulomeR compendium using loadPeaks.

# according to TFBSBrowser results for all CEBPB TFBS query, we select two CEBPB TFBSs 
# from MethMotif and GTRD: MM1_HSA_K562_CEBPB, GTRD-EXP040801_HSA_HL-60_CEBPB.

# loading MethMotif object in "MEME" format. Currently we support "MEME" and "TRANSFAC".
K562_CEBPB <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
HL60_CEBPB <- searchMotif(id = "GTRD-EXP040801_HSA_HL-60_CEBPB", motif_format = "MEME")
#> There are a matched record exported in a MethMotif object.
class(K562_CEBPB)
#> [1] "MethMotif"
#> attr(,"package")
#> [1] "TFregulomeR"

After obtaining a MethMotif matrix, user can use plotLogo to plot logo as below (Figure 2). If the TFBS source is MethMotif, then a MethMotif logo will be saved. Two options are available for motif logo, “entropy” and “frequency”, and also different methylation levels (“all”, “methylated” and “unmethylated”) can be opted for methylation bar charts. However, if the TFBS source is GTRD, only a motif logo will be saved due to the unknown DNA methylation within motif.

In the plot, the number of peaks with motif will be printed. It should be NOTED that the motif logo is generated by the all possible TFBSs (initally scanned by FIMO after de novo motif discovery using MEME-ChIP suite with a p-value less than 1e-4) in the peak regions (+/-100bp around peak summits). It’s possbile that one peak region contains more than one significant TFBSs. Hence, the number of TFBSs in the peak regions could be larger than the number of peaks with motif.

For each MethMotif logo, bar plot above motif logo denotes the number of cytosines in CpG context covered by WGBS at each base position in motif and these CpGs are segregated into three groups shown in different colors, namely homogeneously methylated (orange bar, beta score > 90%), heterogeneously methylated (green bar, beta score 10-90%), and homogenously unmethylated (blue bar, beta score < 10%).

plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "entropy", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-entropy-unmethylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "all")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "methylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-methylated-only.pdf' has been saved!
plotLogo(K562_CEBPB, logo_type = "frequency", meth_level = "unmethylated")
#> Success: a PDF named 'MM1_HSA_K562_CEBPB-logo-frequency-unmethylated-only.pdf' has been saved!

# plot motif logo for GTRD-EXP040801_HSA_HL-60_CEBPB. No DNA methylation states available here
plotLogo(HL60_CEBPB, logo_type = "entropy")
#> Success: a PDF named 'GTRD-EXP040801_HSA_HL-60_CEBPB-logo-entropy.pdf' has been saved!
Figure 2. CEBPB (Meth)Motif logos in K562 and HL-60

Figure 2. CEBPB (Meth)Motif logos in K562 and HL-60

Motif matrix as well as methylation matrix (beta score matrix), if available, can be saved locally using exportMMPFM. To be noted, the function exportMMPFM is also able to export (Meth)Motif matrix for the outputs of commonPeaks, exclusivePeaks and intersectPeakMatrix by specifying “fun =”. We will introduce it in the following section.

# export MethMotif matrix for MM1_HSA_K562_CEBPB
exportMMPFM(fun_output = K562_CEBPB, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB-motif-MEME.txt'.

# export motif matrix for GTRD-EXP040801_HSA_HL-60_CEBPB
exportMMPFM(fun_output = HL60_CEBPB, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... No beta score matrix is available. Skip!
#> ... ... ... ... Motif PFM has been saved as 'GTRD-EXP040801_HSA_HL-60_CEBPB-motif-MEME.txt'.

# exprot motif matrix in TRANSFAC format
K562_CEBPB_TRANSFAC <- searchMotif(id = "MM1_HSA_K562_CEBPB", motif_format = "TRANSFAC")
#> There are a matched record exported in a MethMotif object.
exportMMPFM(fun_output <- K562_CEBPB_TRANSFAC, fun = "searchMotif", save_motif_PFM = TRUE, save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export searchMotif
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB-motif-TRANSFAC.txt'.

More importantly, we allow the users to load peak regions (all peaks or peaks only with motif) of all TFs in TFregulomeR compendium using loadPeaks. To be noted, the peak regions of a given TF in TFregulomeR compendium are the peak summits (hg38 for human), and the TFBS is enriched in a +/- 100bp window surrounding the peak summits. For each peak region, we also provide its tag (read) fold change (fifth column of returned peaks). This read enrichment value is obtained from MACS2 and denotes the fold change of reads in TF ChIP-seq compared to input ChIP-seq.

K562_CEBPB_peaks <- loadPeaks(id = "MM1_HSA_K562_CEBPB", includeMotifOnly = TRUE)
#> Success: peak file has been returned in a data frame!
head(K562_CEBPB_peaks)
#>    chr     start       end                                    id tag_fold_change
#> 1 chr3 101823721 101823722 MM1_HSA_K562_CEBPB_peaks_with_motif_1        22.91742
#> 2 chr3 101850619 101850620 MM1_HSA_K562_CEBPB_peaks_with_motif_2        13.09647
#> 3 chr3 102182290 102182291 MM1_HSA_K562_CEBPB_peaks_with_motif_3        17.28870
#> 4 chr3 105626970 105626971 MM1_HSA_K562_CEBPB_peaks_with_motif_4        23.36092
#> 5 chr3 105647238 105647239 MM1_HSA_K562_CEBPB_peaks_with_motif_5        34.80412
#> 6 chr3 105899733 105899734 MM1_HSA_K562_CEBPB_peaks_with_motif_6        13.61153

Study TFBS propensity

Common peak regions

Figure 3. functionality in TFregulomeR for common peak analysis

Figure 3. functionality in TFregulomeR for common peak analysis

TFregulomeR provides the functionality to find the common peak regions along with DNA methylation profiles and read (tag) enrichments using commonPeaks.

For the target peak sets, users can directly use TFregulomeR TF peaks (hg38 for human) by inputting its TFregulomeR IDs in target_peak_id (all peaks or peaks with motif only can be opted using motif_only_for_target_peak), or their own peak regions in user_target_peak_list. If customised peak sets are provided, all peak sets should be stored in an R list(), and each peak set should be a bed-format data.frame with the first three columns as chromosome (starting with ‘chr’), start and end. It’s recommended that users provide the UNIQUE IDs for their customised peak list in user_target_peak_id (also should be unique to the provided TFregulomeR ID list in target_peak_id). If unavailable, the function will automatically assign IDs for the user’s peak sets. It should be noted that if the customised peak set is derived TFregulomeR compendium, it’s highly recommended that its TFregulomeR ID should be provided correspondingly in user_target_peak_id if one opts to profile the DNA methylation levels. Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in user_target_peak_id and automatically expand +/- 100bp during the analysis.

For the compared peak sets, same rules are applicable to options compared_peak_id, motif_only_for_compared_peak, user_compared_peak_list and user_compared_peak_id when loading compared peak sets.

During the analysis, EACH of target peak set from TFregulomeR compendium using target_peak_id and/or user provided using user_target_peak_list will be compared with ALL input compared peak sets (compared_peak_id and user_compared_peak_list), to get a final target sub-ensemble peaks shared by all compared peak sets. If methylation_profile_in_narrow_region=TRUE, DNA methylation profiling in +/- 100bp surrounding peak summit will be performed for each target common peak sub-ensemble, if its ID labeled in target_peak_id and user_target_peak_list matches a MethMotif ID (“MM1_HSA_”) in TFregulomeR.

# read my local file. Here we use the HCT116 CEBPB binding sites from the publication PMID: 30380113
my_peak_path <- system.file("extdata", "HCT116_CEBPb_binding_sites.txt", package = "TFregulomeR")
my_peak <- read.delim(my_peak_path, sep = "\t", header = FALSE)
head(my_peak)
#>      V1        V2        V3
#> 1  chr1  58585814  58585827
#> 2 chr12 122925699 122925712
#> 3  chr9   5818111   5818124
#> 4 chr19   5850889   5850902
#> 5 chr10   5951274   5951287
#> 6  chr1   8175025   8175038

# To get the sub-ensemble peaks of K562 CEBPB peaks and my peaks, which are
# share with the CEBPB peaks in all cell types from TFregulomeR compendium,
# and at the same time to profile the DNA methylation states in the final subsets.

# 1) Get all CEBPB records in TFregulomeR compendium
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 records(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

# 2) Start commonPeaks analysis
commonPeak_output <- commonPeaks(target_peak_id = "MM1_HSA_K562_CEBPB",
                                 motif_only_for_target_peak = TRUE, 
                                 user_target_peak_list = list(my_peak), 
                                 user_target_peak_id = c("HCT116_CEBPB"), 
                                 compared_peak_id = CEBPB_record$ID, 
                                 motif_only_for_compared_peak = TRUE, 
                                 methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::commonPeaks() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome
#> Loading target peak list ... ...
#> ... You have 1 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_target_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> ... You have 1 customised peak set(s)
#> Loading compared peak list ... ...
#> ... You have 16 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_compared_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'GTRD-EXP000142_HSA_LS180_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP010975_HSA_Ishikawa_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030173_HSA_LoVo_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030702_HSA_blood-monocytes_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP034967_HSA_mesenchymal-stem-cells_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP036478_HSA_fetal-osteoblasts_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040652_HSA_monocyte-derived-macrophages_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040801_HSA_HL-60_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_A549_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_H1-hESC_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HCT116_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HeLa-S3_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HepG2_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_IMR-90_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_MCF-7_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing: MM1_HSA_K562_CEBPB... ...
#> Start analysing: HCT116_CEBPB... ...
#> Done analysing.

After getting the output of commonPeaks, you can use commonPeakResult to get 1) the summary, 2) the common peak regions, 3) DNA methylation levels in 200bp around common peak summits if the input TF source is MethMotif in TFregulomeR compendium and 4) the (Meth)Motif logo if TF input is from TFregulomeR warehouse or with TFregulomeR ID.

commonPeak_result <- commonPeakResult(commonPeaks = commonPeak_output,
                                      return_common_peak_sites = TRUE, 
                                      save_MethMotif_logo = TRUE, 
                                      return_methylation_profile = TRUE, 
                                      return_summary = TRUE)
#> Start getting the results of commonPeakResult ...
#> ... ... You chose to return common peak sites;
#> ... ... You chose to return methylation profile;
#> ... ... You chose to return common peak summary;
#> ... ... ... ALL of common peak sets, methylation profiles and peak summary will be 
#> stored in a list, and named with 'common_peak_list', 'methylation_profile' and 
#> 'peak_summary' in the list. Use 'names()' in the output for its detials.
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_common_peaks-logo-entropy.pdf' has been saved!
#> ... ... ... The input peak set for the results 'HCT116_CEBPB_common_peaks' was 
#> not originated from TFregulomeR or the number of direct binding sites in the 
#> common peaks is 0, so no motif logo available.

# the contents in commonPeak_result
names(commonPeak_result)
#> [1] "common_peak_list"    "methylation_profile" "peak_summary" 
common_peak_list <- commonPeak_result$common_peak_list
methylation_profile <- commonPeak_result$methylation_profile
peak_summary <- commonPeak_result$peak_summary

# peak summary: 1.137225% of K562 CEBPB peaks and 2.517540% of my peaks are common with 
# the CEBPB peaks in all cell types available in TFregulomeR warehouse.
peak_summary
#>                                 percentage_in_original_inputs(%)
#> MM1_HSA_K562_CEBPB_common_peaks                         1.137225
#> HCT116_CEBPB_common_peaks                               2.517540

# common peak regions
K562_CEBPB_common_peak <- common_peak_list$MM1_HSA_K562_CEBPB_common_peaks
head(K562_CEBPB_common_peak)
#>      chr     start       end                                      id tag_fold_change
#> 39  chr3 126261153 126261154  MM1_HSA_K562_CEBPB_peaks_with_motif_39        91.77499
#> 145 chr3 152100699 152100700 MM1_HSA_K562_CEBPB_peaks_with_motif_145        74.01529
#> 276 chr3 188239782 188239783 MM1_HSA_K562_CEBPB_peaks_with_motif_276        13.06116
#> 283 chr3 194003766 194003767 MM1_HSA_K562_CEBPB_peaks_with_motif_283        26.67740
#> 435 chr4  73704747  73704748 MM1_HSA_K562_CEBPB_peaks_with_motif_435        54.26637
#> 456 chr4  77158137  77158138 MM1_HSA_K562_CEBPB_peaks_with_motif_456        31.97526

# methylation profile in common peak regions
names(methylation_profile)
#> [1] "MM1_HSA_K562_CEBPB_common_peaks" "HCT116_CEBPB_common_peaks"
methylation_profile$MM1_HSA_K562_CEBPB_common_peaks
#>         CpG_num
#> 0-10%       161  #161 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around common peak summits
#> 10-20%        9
#> 20-30%       17
#> 30-40%        2
#> 40-50%        1
#> 50-60%        2
#> 60-70%        9
#> 70-80%        5
#> 80-90%       13
#> 90-100%      25 #25 CpG methylation scores are more than 0.9 (homogeneously methylated) in +/-100bp window around common peak summits

# customised input peaks are not originated from TFregulomeR compendium, so no DNA methylation states
methylation_profile$HCT116_CEBPB_common_peaks
#>      [,1]
#> [1,]   NA
Figure 4. MethMotif logo of K562 CEBPB common peaks

Figure 4. MethMotif logo of K562 CEBPB common peaks

Exclusive peak regions

Figure 5. functionality in TFregulomeR for exclusive peak analysis

Figure 5. functionality in TFregulomeR for exclusive peak analysis

Exclusive peak regions are important to study a TF’s context-specific function. Hence, we implemented such functionality to achieve the extraction of the context dependent peak loci along with DNA methylation profiles, exclusivePeaks.

For the target peak sets, users can directly use TFregulomeR TF peaks (hg38 for human) by inputting the TFregulomeR IDs in target_peak_id (all peaks or peaks with motif only can be selected using motif_only_for_target_peak), or their own peak regions in user_target_peak_list. If customised peak sets are provided, all peak sets should be stored in an R list(), and each peak set should be a bed-format data.frame with the first three columns as chromosome (starting with ‘chr’), start and end. It’s recommended that users provide the UNIQUE IDs for their customised peak list in user_target_peak_id (also should be unique to the provided TFregulomeR ID list in target_peak_id). If unavailable, the function will automatically assign IDs for the provided peak sets. It should be noted that if the customised target peak set is derived from TFregulomeR compendium, it’s highly recommended that its TFregulomeR ID should be provided correspondingly in user_target_peak_id if one opts to profile the DNA methylation levels . Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in user_target_peak_id and automatically expand +/- 100bp during the analysis.

For the excluded peak sets, same rules are applicable to options excluded_peak_id, motif_only_for_excluded_peak, user_excluded_peak_list and user_excluded_peak_id when loading excluded peak sets.

During the analysis, EACH of target peak set from TFregulomeR compendium using target_peak_id and/or user provided using user_target_peak_list will be compared with ALL input excluded peak sets from excluded_peak_id and user_excluded_peak_list, to get a final target peak sub-ensemble which is exclusive from all input excluded peak sets. If methylation_profile_in_narrow_region=TRUE, DNA methylation profiling in +/- 100bp surrounding peak summit will be performed for each target exclusive peak sub-ensemble, if its ID labeled in target_peak_id and user_target_peak_list matches a MethMotif ID record (“MM1_HSA_”) in TFregulomeR compendium.

# To get the exclusive subset of K562 CEBPB peaks and at the same time to profile the DNA 
# methylation states in the exclusive subset

# 1) Get all CEBPB records in TFregulomeR warehouse
CEBPB_record <- dataBrowser(tf = "CEBPB") # or TFBSBrowser() before v1.2.0
#> 16 record(s) founded: ...
#> ... covering 1 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 9 organ(s):
#> ... ... colorectum, uterus, blood_and_lymph, stem_cell, bone, lung, cervix, liver, breast
#> ... in 2 sample type(s):
#> ... ... cell_line, primary_cells
#> ... in  16  different cell(s) or tissue(s)
#> ... in 2 type(s) of disease state(s):
#> ... ... tumor, normal
#> ... from the source(s): GTRD, MethMotif

# 2) All CEBPB TFregulomeR IDs except MM1_HSA_K562_CEBPB
CEBPB_record_ID_noK562 <- CEBPB_record$ID[!(CEBPB_record$ID %in% "MM1_HSA_K562_CEBPB")]
exclusivePeak_output <- exclusivePeaks(target_peak_id = "MM1_HSA_K562_CEBPB", 
                                       motif_only_for_target_peak = TRUE, 
                                       excluded_peak_id = CEBPB_record_ID_noK562, 
                                       motif_only_for_excluded_peak = TRUE, 
                                       methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::exclusivePeaks() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome
#> Loading target peak list ... ...
#> ... You have 1 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_target_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Loading excluded peak list ... ...
#> ... You have 15 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_excluded_peak' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'GTRD-EXP000142_HSA_LS180_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP010975_HSA_Ishikawa_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030173_HSA_LoVo_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP030702_HSA_blood-monocytes_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP034967_HSA_mesenchymal-stem-cells_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP036478_HSA_fetal-osteoblasts_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040652_HSA_monocyte-derived-macrophages_CEBPB'
#> ... ... peak file loaded successfully for id 'GTRD-EXP040801_HSA_HL-60_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_A549_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_H1-hESC_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HCT116_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HeLa-S3_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_HepG2_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_IMR-90_CEBPB'
#> ... ... peak file loaded successfully for id 'MM1_HSA_MCF-7_CEBPB'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing: MM1_HSA_K562_CEBPB... ...
#> Done analysing.

After getting the output of exclusivePeaks, you can use exclusivePeakResult to get 1) the summary, 2) the exclusive peak regions, 3) DNA methylation levels in exclusive peak subsets if the input source is MethMotif in TFregulomeR compendium and 4) the (Meth)Motif logo if the input target is from TFregulomeR compendium or comes with a TFregulomeR ID.

exclusivePeak_result <- exclusivePeakResult(exclusivePeaks = exclusivePeak_output,
                                            return_exclusive_peak_sites = TRUE,
                                            save_MethMotif_logo = TRUE, 
                                            return_methylation_profile = TRUE,
                                            return_summary = TRUE)
#> Start getting the results of exclusivePeaks ...
#> ... ... You chose to return exclusive peak sites;
#> ... ... You chose to return methylation profile;
#> ... ... You chose to return exclusive peak summary;
#> ... ... ... ALL of exclusive peak sets, methylation profiles and peak summary will 
#> be stored in a list, and named with 'exclusive_peak_list', 'methylation_profile' and 
#> 'peak_summary' in the list. Use 'names()' in the output for its detials.
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_exclusive_peaks-logo-entropy.pdf' has been saved!

# the contents in exclusivePeak_result
names(exclusivePeak_result)
#> [1] "exclusive_peak_list" "methylation_profile" "peak_summary" 
exclusive_peak_list <- exclusivePeak_result$exclusive_peak_list
peak_summary <- exclusivePeak_result$peak_summary
methylation_profile <- exclusivePeak_result$methylation_profile

# peak summary, 9.110437% of K562 CEBPB peaks are unique compared with 
# all CEBPB TFBSs in TFregulomeR warehouse
peak_summary
#>                                    percentage_in_original_inputs(%)
#> MM1_HSA_K562_CEBPB_exclusive_peaks                         9.110437

K562_CEBPB_exclusive_peak <- exclusive_peak_list$MM1_HSA_K562_CEBPB_exclusive_peaks
head(K562_CEBPB_exclusive_peak)
#>     chr     start       end                                     id tag_fold_change
#> 4  chr3 105626970 105626971  MM1_HSA_K562_CEBPB_peaks_with_motif_4        23.36092
#> 22 chr3 119695686 119695687 MM1_HSA_K562_CEBPB_peaks_with_motif_22        10.89517
#> 45 chr3 128415444 128415445 MM1_HSA_K562_CEBPB_peaks_with_motif_45        40.57037
#> 57 chr3 130184388 130184389 MM1_HSA_K562_CEBPB_peaks_with_motif_57        23.26712
#> 58 chr3 130343212 130343213 MM1_HSA_K562_CEBPB_peaks_with_motif_58        13.14518
#> 68 chr3 133629332 133629333 MM1_HSA_K562_CEBPB_peaks_with_motif_68        20.32872

# methylation profile in exclusive peak regions
names(methylation_profile)
#> [1] "MM1_HSA_K562_CEBPB_exclusive_peaks"
methylation_profile$MM1_HSA_K562_CEBPB_exclusive_peaks
#>         CpG_num
#> 0-10%       852  #852 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around exclusive peak summits
#> 10-20%      149
#> 20-30%       86
#> 30-40%       63
#> 40-50%       45
#> 50-60%       39
#> 60-70%       25
#> 70-80%       46
#> 80-90%       29
#> 90-100%      40 #40 CpG methylation scores are more than 0.9 ((homogeneously methylated)) in +/-100bp window around exclusive peak summits
Figure 6. MethMotif logo of K562 CEBPB exclusive peaks

Figure 6. MethMotif logo of K562 CEBPB exclusive peaks

Intersected peak matrix

Figure 7. functionality in TFregulomeR for cofactor and interactome analysis

Figure 7. functionality in TFregulomeR for cofactor and interactome analysis

TFregulomeR allows users to portray the co-binding landscapes between two collections of TFs, along with DNA methylation states in the pair-wise intersected peaks, using intersectPeakMatrix. This functionality is particularly useful to study TF cofactor and interactome (see the detail regarding TF interactome at here) in a cell type. Different from commonPeaks, intersectPeakMatrix perform an exhaustive pair-wise intersection analysis between peak list x and peak list y, to form an x*y intersection matrix. Therefore, it is required for users to provide the two lists of peak sets.

For peak list x, users can directly use TFregulomeR peaks by providing TFregulomeR ID in peak_id_x and indicating whether loading peaks with motif only using motif_only_for_id_x. In addition, customised peak sets can also be input in user_peak_list_x. It’s recommended that UNIQUE IDs (also unique to IDs in peak_id_x) be provided for each customised peak set in user_peak_x_id. If the customised peak set is derived from TFregulomeR compendium, it’s highly recommended that the corresponding TFregulomeR ID be provided in user_peak_x_id, which allows the function to recognise the source of peak set and to properly profile the DNA methylation states (if methylation_profile_in_narrow_region=TRUE) as well as read enrichments in the intersected regions. Even though TFregulomeR peak sets are peak summits, the function is able to recognise it with the provided TFregulomeR ID in peak_id_x and automatically expand +/- 100bp during the analysis.

Same principles are applicable for peak list y.

In addidtion to profile DNA methylation landscapes across the peak regions of different TF-pair combinations (TFx-TFy), from TFregulomeR v1.2.2 onwards, users are allowed to input their own external signals in bed format (at least four column: chromosome, start, end, and score) and assess external signal scores across the peak regions of different TF-pair combinations (TFx-TFy) at the same time. For example, if DNase-seq bedgraph is provided (the fourth column denotes DNase-seq read intensity), DNase-seq read intensity values will be profiled during TF intersection analysis. Just simply inputting this data via external_source in intersectPeakMatrix, users could achieve this analysis. Of note, it is highly recommended that users could refine the external signal file into the certain potential chromosome regions of their interest to speed up the analysis, because the file at whole-genome level could substantially slow down analysis (see the example at TF interactome demo).

Advice: It’s to be noted that function intersectPeakMatrix could take up several minutes to hours depending on the number of input TFs. Save the intersectPeakMatrix output into R data format using saveRDS(object, file="my_data.rds") and restore it using readRDS(file="my_data.rds") if you need to re-use the output to extract other results with function intersectPeakMatrixResult.

# profile the co-binding landscapes of all K562 TFs in TFregulomeR warehouse surrounding
# K562 CEBPB common and exclusive peaks

# browse all TFBS record in K562
K562_TFBS = dataBrowser(cell_tissue_name = "K562") # or TFBSBrowser() before v1.2.0
#> 131 record(s) found: ...
#> ... covering 131 TF(s)
#> ... from 1 species:
#> ... ...human
#> ... from 1 organ(s):
#> ... ... blood_and_lymph
#> ... in 1 sample type(s):
#> ... ... cell_line
#> ... in  1  different cell(s) or tissue(s)
#> ... in 1 type(s) of disease state(s):
#> ... ... tumor
#> ... from the source(s): MethMotif

# co-binding landscapes in K562 CEBPB common peaks
intersectMatrix_common <- intersectPeakMatrix(user_peak_list_x = list(K562_CEBPB_common_peak),
                                              user_peak_x_id =  c("MM1_HSA_K562_CEBPB"), 
                                              peak_id_y = K562_TFBS$ID, 
                                              motif_only_for_id_y = TRUE, 
                                              methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::intersectPeakMatrix() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome. It will make the program slow. 
#> Disable it if you want a speedy analysis and do not care about methylation
#> Loading peak list x ... ...
#> ... You have 1 customised peak set(s)
#> Loading peak list y ... ...
#> ... You have 131 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_id_y' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_AFF1'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ARID2'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ARID3A'
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ATF1'
#>     ... ...
#> ... ... peak file loaded successfully for id 'MM1_HSA_K562_ZSCAN29'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing list x:MM1_HSA_K562_CEBPB... ...
#> ... ... Start analysing list y:MM1_HSA_K562_AFF1
#> ... ... Start analysing list y:MM1_HSA_K562_ARID2
#> ... ... Start analysing list y:MM1_HSA_K562_ARID3A
#> ... ... Start analysing list y:MM1_HSA_K562_ATF1
#>     ... ...
#> ... ... Start analysing list y:MM1_HSA_K562_ZSCAN29

# K562 CEBPB exclusive peaks
intersectMatrix_exclusive <- intersectPeakMatrix(user_peak_list_x = list(K562_CEBPB_exclusive_peak), 
                                                 user_peak_x_id = c("MM1_HSA_K562_CEBPB"), 
                                                 peak_id_y = K562_TFBS$ID,
                                                 motif_only_for_id_y = TRUE,
                                                 methylation_profile_in_narrow_region = TRUE)
#> TFregulomeR::intersectPeakMatrix() starting ... ...
#> You chose to profile the methylation levels in 200bp window around peak summits, 
#> if there is any peak loaded from TFregulome. It will make the program slow. 
#> Disable it if you want a speedy analysis and do not care about methylation
#> Loading peak list x ... ...
#> ... You have 1 customised peak set(s)
#> Loading peak list y ... ...
#> ... You have 108 TFBS(s) requested to be loaded from TFregulomeR server
#> ... You chose to load TF peaks with motif only. Using 'motif_only_for_id_y' tunes your options
#> ... loading TFBS(s) from TFregulomeR now
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_AFF1'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ARID2'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ARID3A'
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ATF1'
#>     ... ...
#> .. ... peak file loaded successfully for id 'MM1_HSA_K562_ZSCAN29'
#> ... Done loading TFBS(s) from TFregulome
#> Start analysing list x:MM1_HSA_K562_CEBPB... ...
#> ... ... Start analysing list y:MM1_HSA_K562_AFF1
#> ... ... Start analysing list y:MM1_HSA_K562_ARID2
#> ... ... Start analysing list y:MM1_HSA_K562_ARID3A
#> ... ... Start analysing list y:MM1_HSA_K562_ATF1
#>     ... ...
#> ... ... Start analysing list y:MM1_HSA_K562_ZSCAN29

We have implemented intersectPeakMatrixResult in TFregulomeR package, allowing an easy extraction and interpretation of intersectPeakMatrix output. It is worth noting that there are two ways of interpretations in intersection between set A and B, that is, 1) the percentage of A overlapped with B, and 2) the percentage of B intersected with A. Same principle is applicable for the output of intersectPeakMatrix.

The output of intersectPeakMatrix is a X*Y matrix table (X peak sets in peak list x and Y peak sets in peak list y) and each table cell contains an intersectPeakMatrix class object. IntersectPeakMatrix class is an exclusively-designed S4 class and each object contains:

  1. Information of peak x subset overlapped with peak y:
  1. overlap percentage in peak x;

  2. motif in peak x overlapped with peak y;

  3. tag enrichment in peak x overlapped with peak y;

  4. DNA methylation in peak x overlapped with peak y, if peak x is derived from MethMotif in TFregulomeR compendium.

  5. external signal score in peak x overlapped with peak y, if external signal file is provided.

and

  1. Information of peak y subset overlapped with peak x:
  1. overlap percentage in peak y;

  2. motif in peak y overlapped with peak x;

  3. tag enrichment in peak y overlapped with peak x;

  4. DNA methylation in peak y overlapped with peak x, if peak y is derived from MethMotif in TFregulomeR compendium.

  5. external signal score in peak y overlapped with peak x, if external signal file is provided.

By using return_intersection_matrix = TRUE and angle_of_matrix = "x" in function
intersectPeakMatrixResult, user can obtain an intersection matrix table. Row i and column j table cell denotes the percent of peak x(i) overlapped with peak y(j). If angle_of_matrix = "y", then row i and column j table cell denotes the percent of peak y(j) overlapped with peak x(i).

By using return_methylation_profile = TRUE and angle_of_methylation_profile = "x" in function
intersectPeakMatrixResult, user can obtain a DNA methylation matrix table. Row i and column j table cell contains a vector about the statistics of CpG methylation states within the peak x(i) overlapped with peak y(j). If angle_of_matrix = "y", then row i and column j table cell contains a vector about the statistics of CpG methylation states within the peak y(j) overlapped with peak x(i).

By using save_MethMotif_logo = TRUE and angle_of_logo = "x" in function
intersectPeakMatrixResult, the function will plot and save (Meth)Motif logo for each peak x intersected with peak y (if any of peak x is derived from TFregulomeR compendium). If angle_of_logo = "y", the function will plot and save (Meth)Motif logo for each peak y intersected with peak x (if any of peak y is derived from TFregulomeR compendium). Indeed, it’s not necessary to plot all (Meth)Motif logos for every pair of intersection at the same time. One can focus only on some subsets of peak_list_x and peak_list_y, using saving_MethMotif_logo_x_id and saving_MethMotif_logo_y_id respectively.

By using return_tag_density = TRUE and angle_of_tag_density = "x" in function
intersectPeakMatrixResult, user can obtain a read enrichment matrix table. Row i and column j table cell denotes a read enrichment value within the peak x(i) overlapped with peak y(j). If angle_of_tag_density = "y", then row i and column j table cell denotes a read enrichment value within the peak y(j) overlapped with peak y(x).There are five read enrichment values to be selected using tag_density_value: median, mean, SD (standard deviation), quartile_25 (first quartile) and quartile_75 (third quartile).

Lastly, by using return_external_source = TRUE and angle_of_external_source = "x" in function
intersectPeakMatrixResult, user can obtain an external signal value matrix table. Row i and column j table cell denotes an external signal value within the peak x(i) overlapped with peak y(j). If angle_of_external_source = "y", then row i and column j table cell denotes an external signal value within the peak y(j) overlapped with peak y(x).There are five external signal values to be selected using external_source_value: median, mean, SD (standard deviation), quartile_25 (first quartile) and quartile_75 (third quartile).

In order to make cofactor analysis more straightforward, a function cofactorReport has been exclusively designed. Just by simply inputting the output of intersectPeakMatrix into the function, cofactor report will be saved as a PDF file for each peak x. In the report, top cofactors of peak x will be reported along with motif sequences, DNA methylation within motif and 200bp peak regions as well as read enrichment scores (median, first quartile and third quartile).

# get the intersection matrix for K562 common peaks
IM_K562_common_result <- intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_common, 
                                                   return_intersection_matrix = TRUE, 
                                                   angle_of_matrix = "x", 
                                                   return_methylation_profile = TRUE, 
                                                   angle_of_methylation_profile = "x",
                                                   return_tag_density = TRUE,
                                                   angle_of_tag_density = "x",
                                                   tag_density_value = "median")
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose to return intersection matrix;
#> ... ... ... You chose x-wise intersection matrix;
#> ... ... You chose to return tag density;
#> ... ... ... the tag density value you chose to return is median
#> ... ... ... You chose to return tag density for peak list x;
#> ... ... You chose to return methylation profile;
#> ... ... ... You chose to return methylation profile for peak list x;
#> ... ... You chose NOT to save MethMotif logo in PDF if any;

names(IM_K562_common_result)
#> [1] "intersection_matrix"        "tag_density_matrix"         "methylation_profile_matrix"

# return intersection matrix table for K562 CEBPB shared peaks
IM_K562_common_intersect_matrix <- IM_K562_common_result$intersection_matrix
dim(IM_K562_common_intersect_matrix)
#> [1]   1 131 # 1 row represents the 1 input peak x, 131 columns represent the 131 peak y

# e.g. 1.111111%, 3.333333% and  1.111111% of K562 common peaks overlapped with MM1_HSA_K562_AFF1, MM1_HSA_K562_ARID2 and MM1_HSA_K562_ARID3A respectively.
IM_K562_common_intersect_matrix[1,1:3]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB          1.111111           3.333333            1.111111

# find the top 10 TFs co-binding in K562 CEBPB shared peaks
IM_K562_common_result_t <- as.data.frame(t(IM_K562_common_intersect_matrix))
attach(IM_K562_common_result_t)
IM_K562_common_result_order <- as.data.frame(IM_K562_common_result_t[order(-MM1_HSA_K562_CEBPB),,drop = FALSE])
detach(IM_K562_common_result_t)
head(IM_K562_common_result_order, n = 10)
#>                    MM1_HSA_K562_CEBPB
#> MM1_HSA_K562_CEBPB         100.000000
#> MM1_HSA_K562_CEBPD          36.666667
#> MM1_HSA_K562_CTCF           12.222222
#> MM1_HSA_K562_E4F1           10.000000
#> MM1_HSA_K562_JUND           10.000000
#> MM1_HSA_K562_FOS             8.888889
#> MM1_HSA_K562_JUN             8.888889
#> MM1_HSA_K562_CTCFL           7.777778
#> MM1_HSA_K562_ELF4            7.777778
#> MM1_HSA_K562_ATF4            6.666667

# The highest co-binding factor in K562 CEBPB common peaks is CEBPD.
# Then plot MethMotif logo for the K562 CEBPB common sites intersected with CEBPD peaks
intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_common, 
                          save_MethMotif_logo = TRUE, 
                          angle_of_logo = "x", 
                          saving_MethMotif_logo_y_id = c("MM1_HSA_K562_CEBPD"))
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose NOT to return intersection matrix;
#> ... ... You chose NOT to return methylation profile;
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose x-wise MethMotif logo;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-logo-entropy.pdf' 
#> has been saved!

# return DNA methylation matrix table for K562 CEBPB shared peaks
IM_K562_common_methylation <- IM_K562_common_result$methylation_profile_matrix
dim(IM_K562_common_methylation)
#> [1]   1 131 # 1 row represents the 1 input peak x, 131 columns represent the 131 peak y
rownames(IM_K562_common_methylation)
#> [1] "MM1_HSA_K562_CEBPB"
colnames(IM_K562_common_methylation)[1:5]
#> [1] "MM1_HSA_K562_AFF1"   "MM1_HSA_K562_ARID2"  "MM1_HSA_K562_ARID3A"
#> [4] "MM1_HSA_K562_ATF1"   "MM1_HSA_K562_ATF2"  

# the methylation level in the intersected regions between CEBPB and CEBPBD in K562
# each cell in the matrix is a list()
IM_K562_common_methylation["MM1_HSA_K562_CEBPB","MM1_HSA_K562_CEBPD"][[1]]
#>         CpG_num
#> 0-10%        64 #64 CpG methylation scores are less than 0.1 (homogeneously unmethylated) in +/-100bp window around intersected peak summits
#> 10-20%        8
#> 20-30%        7
#> 30-40%        2
#> 40-50%        1
#> 50-60%        1
#> 60-70%        8
#> 70-80%        3
#> 80-90%       10
#> 90-100%      16 #16 CpG methylation scores are more than 0.9 (homogeneously methylated) in +/-100bp window around intersected peak summits


# return read enrichment matrix table for K562 CEBPB shared peaks
IM_K562_common_read_enrichment <- IM_K562_common_result$tag_density_matrix
IM_K562_common_read_enrichment[,seq(1,3,1)]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB          54.27878           34.56091            17.68113
# the read ernichment median value in the K562 CEBPB peaks overlapped with AFF1 is 54.27878

# Simply using cofactorReport(), cofactors for each peak x will be automatically reported
cofactorReport(intersectPeakMatrix = intersectMatrix_common)
#> Start cofactorReport ...
#> ... The maximum number of cofactors to be reported is 10
#> ... The minimum percent of co-binding peaks for a cofactor is 5%
#> ... Each peak set derived from TFregulomeR compendium in 'peak list x' will be reported in an individual PDF file
#> ... ... Start reporting peak id 'MM1_HSA_K562_CEBPB' ...
#> ... ... ... The number of cofactors passing 'cobinding_threshold' for peak id 'MM1_HSA_K562_CEBPB' is 15. Only top 10 will be selected.
#> ... ... ... Cofactor report for id 'MM1_HSA_K562_CEBPB' has been saved as MM1_HSA_K562_CEBPB_cofactor_report.pdf

# get the intersection matrix for K562 exclusive peaks
IM_K562_exclusive_result <- intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_exclusive, 
                                                      return_intersection_matrix = TRUE, 
                                                      angle_of_matrix = "x", 
                                                      return_methylation_profile = TRUE, 
                                                      angle_of_methylation_profile = "x",
                                                      return_tag_density = TRUE,
                                                      angle_of_tag_density = "x",
                                                      tag_density_value = "median")
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose to return intersection matrix;
#> ... ... ... You chose x-wise intersection matrix;
#> ... ... You chose to return tag density;
#> ... ... ... the tag density value you chose to return is median
#> ... ... ... You chose to return tag density for peak list x;
#> ... ... You chose to return methylation profile;
#> ... ... ... You chose to return methylation profile for peak list x;
#> ... ... You chose NOT to save MethMotif logo in PDF if any;

names(IM_K562_exclusive_result)
#> [1] "intersection_matrix"        "tag_density_matrix"         "methylation_profile_matrix"

# return intersection matrix table for K562 CEBPB exclusive peaks
IM_K562_exclusive_intersect_matrix <- IM_K562_exclusive_result$intersection_matrix
# e.g. 0.4160888%, 0% and  0.554785% of K562 exclusive peaks overlapped with MM1_HSA_K562_AFF1, MM1_HSA_K562_ARID2 and MM1_HSA_K562_ARID3A respectively.
IM_K562_exclusive_intersect_matrix[1,1:3]
#>                    MM1_HSA_K562_AFF1 MM1_HSA_K562_ARID2 MM1_HSA_K562_ARID3A
#> MM1_HSA_K562_CEBPB         0.4160888                  0            0.554785

# find the top 10 TFs co-binding in K562 exclusive results
IM_K562_exclusive_result_t <- as.data.frame(t(IM_K562_exclusive_intersect_matrix))
attach(IM_K562_exclusive_result_t)
IM_K562_exclusive_result_order <- as.data.frame(IM_K562_exclusive_result_t[order(-MM1_HSA_K562_CEBPB),,drop = FALSE])
detach(IM_K562_exclusive_result_t)
head(IM_K562_exclusive_result_order, n = 10)
#>                      MM1_HSA_K562_CEBPB
#> MM1_HSA_K562_CEBPB           100.000000
#> MM1_HSA_K562_ATF4             35.367545
#> MM1_HSA_K562_TCF12            17.337032
#> MM1_HSA_K562_CBFA2T3          16.643551
#> MM1_HSA_K562_TAL1             16.366158
#> MM1_HSA_K562_GATA1            11.650485
#> MM1_HSA_K562_FOXM1             9.431345
#> MM1_HSA_K562_ATF7              8.876560
#> MM1_HSA_K562_NFE2              8.599168
#> MM1_HSA_K562_MAFG              7.073509

# The highest co-binding factor in K562 CEBPB exclusive peaks is ATF4.
# Then plot MethMotif logo for the K562 CEBPB exclusive sites intersected with ATF4 peaks
intersectPeakMatrixResult(intersectPeakMatrix = intersectMatrix_exclusive, 
                          save_MethMotif_logo = TRUE, 
                          angle_of_logo = "x", 
                          saving_MethMotif_logo_y_id = c("MM1_HSA_K562_ATF4"))
#> Start getting the results of intersectPeakMatrix ...
#> ... ... You chose NOT to return intersection matrix;
#> ... ... You chose NOT to return methylation profile;
#> ... ... You chose to save MethMotif logo in PDF if any;
#> ... ... ... You chose x-wise MethMotif logo;
#> ... ... ... You chose entropy logo;
#> ... ... ... You chose to show all methylation levels;
#> Success: a PDF named 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-logo-entropy.pdf' 
#> has been saved!

# the methylation levels in the intersected regions between CEBPB and ATF4
IM_K562_exclusive_methylation <- IM_K562_exclusive_result$methylation_profile_matrix
IM_K562_exclusive_methylation["MM1_HSA_K562_CEBPB","MM1_HSA_K562_ATF4"][[1]]
#>         CpG_num
#> 0-10%       303
#> 10-20%       44
#> 20-30%       22
#> 30-40%       19
#> 40-50%       12
#> 50-60%       11
#> 60-70%        8
#> 70-80%       14
#> 80-90%       12
#> 90-100%       8

# Simply using cofactorReport(), cofactors for each peak x will be automatically reported
cofactorReport(intersectPeakMatrix = intersectMatrix_exclusive)
#> Start cofactorReport ...
#> ... The maximum number of cofactors to be reported is 10
#> ... The minimum percent of co-binding peaks for a cofactor is 5%
#> ... Each peak set derived from TFregulomeR compendium in 'peak list x' will be reported in an individual PDF file
#> ... ... Start reporting peak id 'MM1_HSA_K562_CEBPB' ...
#> ... ... ... The number of cofactors passing 'cobinding_threshold' for peak id 'MM1_HSA_K562_CEBPB' is 13. Only top 10 will be selected.
#> ... ... ... Cofactor report for id 'MM1_HSA_K562_CEBPB' has been saved as MM1_HSA_K562_CEBPB_cofactor_report.pdf
Figure 8. MethMotif logo of K562 CEBPB common peaks intersected with K562 CEBPD peaks

Figure 8. MethMotif logo of K562 CEBPB common peaks intersected with K562 CEBPD peaks

Figure 9. MethMotif logo of K562 CEBPB exclusive peaks intersected with K562 ATF4 peaks

Figure 9. MethMotif logo of K562 CEBPB exclusive peaks intersected with K562 ATF4 peaks

Figure 10. cofactorReport output

Figure 10. cofactorReport output

Intersected peak matrix extended - TF interactome demo

TF interactome is a global view of TF cooperativity in a cell type. The function intersectPeakMatrix implemented in TFregulomeR is able to do TF interactome analysis using TFregulomeR data compendium. By just simply setting TF x and y list in the input of intersectPeakMatrix as the same list of all TFs in a cell type, pair-wise intersection profiles of every two TFs in the cell type will be charted. If the analysed cell type in TFregulomeR data compendium is originated from MethMotif database, users can even profile DNA methylation landscapes across the TF interactome by choosing methylation_profile_in_narrow_region = TRUE. This strategy is important to identify mCpG-binding TF complexes in a cell type. Moreover, by taking advantage of parameter external_source in intersectPeakMatrix, users are allowed to chart the signal of their own interest across TF interactome. For example, if DNase-seq bedgraph is provided (fourth column is DNase-seq read intensity), chromatin accessibility could be profiled across TF interactome.

In order to facilitate TF interactome analysis and result interpretation, we have built a new function interactome3D in TFregulomeR (>= 1.2.2). The input of this function is the direct output of intersectPeakMatrix, and the output of the function is a dynamic 3D html report showing TF interactome coupled with CpG methylation and/or external signals (if return_interactome_with_mCpG = TRUE and/or return_interactome_with_external_source = TRUE in the function). The analysis processes of this function are: 1) obtain three matrix tables from intersectPeakMatrix output, namely TF intersection, CpG methylation and external signal value matrix tables; 2) both row and column of TF intersection matrix table will undergo hierarchical clustering with euclidean distance to classify those TFs that tend to co-localise with one another; 3) CpG methylation and external signal value matrix tables will be row- and column-wise reordered based on the result of clustered TF intersection matrix table; and 4) TF intersection matrix will be integrated with CpG methylation and external signal value matrix tables separately to form two dynamic 3D graphs.

In the dynamic 3D graph, x and y axis show TFs in a cell type, the color shade denotes the percentage of TFx peaks overlapped with TFy, and z axis show the percentage of mCpGs (the beta score threshold can be modified via mCpG_threshold in interactome3D) or external signal value (modified via external_source_value in interactome3D). Particularly, if z axis value is -1, it means: 1) the overlapped peak regions between TFx and TFy is zero; or 2) CpG methylation information or external signal value is not available (methylation_profile_in_narrow_region = FALSE or analysed cell type not from MethMotif or external signal not provided during intersect matrix analysis).

In the following small demo, we profile TF interactome among CEBPA, CEBPB, CEBPD, and CEBPG coupled with CpG methylation and chromatin accessibility in HepG2.

# we generate DNase-seq bedgraph in HepG2 (fourth column is normalised read intensity). In order to speed up analysis, we didnt use DNase-seq bedgraph at whole genome levels; instead, we only selected those regions overlapped by all TF cistrome in HepG2.
DNase <- read.table("HepG2_DNase-seq-UW_in_HepG2_cistrome.txt")

HepG2_CEBP_id <- c("MM1_HSA_HepG2_CEBPA","MM1_HSA_HepG2_CEBPB",
                   "MM1_HSA_HepG2_CEBPD","MM1_HSA_HepG2_CEBPG")

HepG2_CEBP_interactome <- intersectPeakMatrix(peak_id_x = HepG2_CEBP_id,
                                              motif_only_for_id_x = TRUE,
                                              peak_id_y = HepG2_CEBP_id,
                                              motif_only_for_id_y = TRUE, 
                                              methylation_profile_in_narrow_region = TRUE,
                                              external_source = DNase)
# Simply use interactome3D to generate TF interactome report
interactome3D(intersectPeakMatrix = HepG2_CEBP_interactome,
              return_interactome_with_mCpG = TRUE, 
              return_interactome_with_external_source = TRUE)
#> Start interactome3D ...
#> ... You chose to report TF interactome coupled with DNA methylation ...
#> ... ... The mCpG threshold you chose is 0.8
#> ... You chose to report TF interactome coupled with external source signal ...
#> ... ... The external source signal value you chose is median
#> ... report of TF interactome with mCpG portion has been saved as 'TF_interactome_with_mCpG.html'
#> ... report of TF interactome with external source signal has been saved as 'TF_interactome_with_external_source.html'

Export motif PFM and beta score matrix

As aforementioned, exportMMPFM is not only designed for searchMotif outputs, but also compatible with the outputs from commonPeaks, exclusivePeaks, intersectPeakMatrix. Here, we want to export the motif PFMs and beta score matrices for K562 CEBPB common and exclusive peaks, as well as the common peaks intersected with CEBPD peaks and the exclusive peaks intersected with ATF4 peaks.

# export motif PFM and beta score matrix for K562 CEBPB common peaks
exportMMPFM(fun_output = commonPeak_output, 
            fun = "commonPeaks", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export commonPeaks...
#> ... ... ... export id = MM1_HSA_K562_CEBPB_common_peaks
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_common_peaks-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_common_peaks-motif-MEME.txt'.
#> ... ... ... export id = HCT116_CEBPB_common_peaks
#> ... ... ... the original peaks of HCT116_CEBPB_common_peaks is not loaded from TFregulomeR, 
#> or in the common peak the number of TFBS is zero. Hence no further action for this id!

# export motif PFM and beta score matrix for K562 CEBPB exclusive peaks
exportMMPFM(fun_output = exclusivePeak_output, 
            fun = "exclusivePeaks", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE)
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export exclusivePeaks...
#> ... ... ... export id = MM1_HSA_K562_CEBPB_exclusive_peaks
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_exclusive_peaks-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_exclusive_peaks-motif-MEME.txt'.

# export motif PFM and beta score matrix for K562 CEBPB common peaks intersected with K562 CEBPD peaks
exportMMPFM(fun_output = intersectMatrix_common, 
            fun = "intersectPeakMatrix", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE, 
            angle_of_matrix_for_intersectPeakMatrix = "x", 
            saving_id_y_for_intersectPeakMatrix = c("MM1_HSA_K562_CEBPD"))
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export intersectPeakMatrix...
#> ... ... we will export in the x wide of intersectPeakMatrix output since the input angle_of_matrix_for_intersectPeakMatrix = 'x'
#> ... ... ... export id = MM1_HSA_K562_CEBPB
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_CEBPD-motif-MEME.txt'.

# export motif PFM and beta score matrix for K562 CEBPB exclusive peaks intersected with K562 ATF4 peaks
exportMMPFM(fun_output = intersectMatrix_exclusive, 
            fun = "intersectPeakMatrix", 
            save_motif_PFM = TRUE, 
            save_betaScore_matrix = TRUE, 
            angle_of_matrix_for_intersectPeakMatrix = "x", 
            saving_id_y_for_intersectPeakMatrix = c("MM1_HSA_K562_ATF4"))
#> Start exporting ... ...
#> ... ... You chose to save motif PFM and beta score matrix.
#> ... ... export intersectPeakMatrix...
#> ... ... we will export in the x wide of intersectPeakMatrix output since the input angle_of_matrix_for_intersectPeakMatrix = 'x'
#> ... ... ... export id = MM1_HSA_K562_CEBPB
#> ... ... ... ... Beta score matrix has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-methScore.txt'.
#> ... ... ... ... Motif PFM has been saved as 'MM1_HSA_K562_CEBPB_overlapped_with_MM1_HSA_K562_ATF4-motif-MEME.txt'.

Motif distribution

TFregulomeR also allows users to plot the distributions of a given TFBS from the TFregulomeR warehouse in a list of peak sets, using motifDistrib and plotDistrib sequentially. By providing the TFregulomeR ID in id as the input of the motifDistrib, motifDistrib will calculate the occurrences the TFBSs in the given list of peak sets input in peak_list. The unique IDs corresponding to peak_list is required to be provided at the same time using peak_id. If a peak set is derived from TFregulomeR compendium, the TFregulomeR ID should be provided correspondingly; if it is self provided, you can name it with a unique ID yourself.

It should be noted that even though the loaded peak regions from the TFregulomeR compendium are the peak summits, you don’t need to expand the regions. Once you tell motifDistrib the peak set is a TFregulomeR TF subset by providing TFregulomeR ID in the peak_id, it will automatically operate on the peaks itself.

The output of motifDistrib is the input of plotDistrib. In each motif distribution plot, x-axis is the relative distance (bp) to the peak center, while y-axis is the percentage of the TFBS at the position.

Here, we show the distributions of K562 CEBPB motif in K562 CEBPB exclusive peaks and our own peaks locally loaded previously.

# loading my peaks
my_peak_path <- system.file("extdata", "HCT116_CEBPb_binding_sites.txt", package = "TFregulomeR")
my_peak <- read.delim(my_peak_path, sep = "\t", header = FALSE)

motifDistrib_output <- motifDistrib(id = "MM1_HSA_K562_CEBPB", 
                                    peak_list = list(K562_CEBPB_exclusive_peak,
                                                     my_peak),
                                    peak_id = c("MM1_HSA_K562_CEBPB","my_peak"))
#> motifDistrib starts analysing for MethMotif ID = MM1_HSA_K562_CEBPB
#> ... ... analysing peak set MM1_HSA_K562_CEBPB
#> ... ... analysing peak set my_peak
plotDistrib(motifDistrib = motifDistrib_output)
#> Distribution of motif MM1_HSA_K562_CEBPB in peak set MM1_HSA_K562_CEBPB has been saved!
#> Distribution of motif MM1_HSA_K562_CEBPB in peak set my_peak has been saved!
Figure 11. Motif distribution

Figure 11. Motif distribution

Annotate TFBS locations

TFregulomeR is able to annotate TFBS genomic locations using genomeAnnotate. The annotation process is following the order: promoter, TTS, exon, 5’ UTR exon, 3’ UTR exon, intron and intergenic region. Specifically, promoter is defined as the range from 1000bp upstream of TSS to 100bp downstream of TSS, and TTS is defined as the range from 100bp upstream of TTS to 1000bp downstream of TTS. Users can change the parameters using promoter_range and TTS_range respectively. The annotation output of genomeAnnotate is intuitive, not only will a data.frame containing annotation results be returned, but also an HTML report will be saved.

# annotate the locations of K562 CEBPB exclusive peaks
# loading UCSC knownGene
library(TxDb.Hsapiens.UCSC.hg38.knownGene)

K562_CEBPB_exclusivePeak_loc <- genomeAnnotate(peaks = K562_CEBPB_exclusive_peak, 
                                               return_annotation = TRUE, 
                                               return_html_report = TRUE)
#> Start genomeAnnotate ...
#> ... ... You chose to return annotated results in a data.frame.
#> ... ... You chose to return an HTML report.
#> ... ... annotating promoters defined as upstream 1000bp and downstream 100bp
#> ... ... annotating TTS defined as upstream 100bp and downstream 1000bp
#> ... ... annotating exons
#> ... ... annotating 5' UTR
#> ... ... annotating 3' UTR
#> ... ... annotating introns
#> ... ... annotating intergenic regions
#> ... ... An html report has been generated as 'genomeAnnotate_result.html'!
#> ... ... The annotation results have been returned in a data.frame!

head(K562_CEBPB_exclusivePeak_loc)
#>    chr     start       end                      id   annotation geneName
#> 1 chr3 133629333 133629333  genomeAnnotate_peak_68 promoter-TSS   TOPBP1
#> 2 chr1 155858041 155858041 genomeAnnotate_peak_255 promoter-TSS    GON4L
#> 3 chr3 193662882 193662882 genomeAnnotate_peak_280 promoter-TSS     OPA1
#> 4 chr4  72038972  72038972 genomeAnnotate_peak_426 promoter-TSS   NPFFR2
#> 5 chr4  74448009  74448009 genomeAnnotate_peak_447 promoter-TSS     AREG
#> 6 chr4 143905637 143905637 genomeAnnotate_peak_635 promoter-TSS     GYPE
#>                                                                                                    transcript
#> 1                                                                         ENST00000513818.1;ENST00000506779.1
#> 2 ENST00000368331.5;ENST00000271883.9;ENST00000620426.4;ENST00000361040.9;ENST00000471341.5;ENST00000622608.1
#> 3                                                                                           ENST00000445863.1
#> 4                                                                         ENST00000395999.5;ENST00000358749.3
#> 5                                                                                           ENST00000511560.1
#> 6                                                       ENST00000358615.8;ENST00000506264.5;ENST00000437468.2
#>             distanceToTSS
#> 1                 941;602
#> 2 745;745;745;836;823;745
#> 3                       8
#> 4                 205;401
#> 5                     629
#> 6               77;117;73

TFregulomeR - genomeAnnotate Result



Table - genomic annotations of the peaks:
chr start end annotation geneName transcript distanceToTSS
chr3 133629333 133629333 promoter-TSS TOPBP1 uc062nzf.1;uc062nzg.1 941;602
chr1 155858041 155858041 promoter-TSS GON4L uc001flz.4;uc057lsw.1;uc057lsx.1;uc001fmc.5;uc057lta.1;uc057ltb.1 745;745;745;836;823;745
chr3 193662882 193662882 promoter-TSS OPA1 uc062rmv.1 8
chr4 72038972 72038972 promoter-TSS NPFFR2 uc003hgi.3;uc003hgh.3 205;401
chr4 74448009 74448009 promoter-TSS AREG uc062xhs.1 629
chr4 143905637 143905637 promoter-TSS GYPE uc003ijj.4;uc062zwx.1;uc003ijk.5 77;117;73
chr4 144019438 144019438 promoter-TSS GYPB uc062zxb.1;uc003ijm.2;uc062zxc.1;uc062zxd.1;uc062zxe.1;uc011chw.2;uc011chx.2;uc062zxf.1;uc062zxg.1;uc062zxh.1;uc062zxi.1;uc062zxj.1;uc062zxk.1;uc062zxl.1;uc062zxm.1 98;98;100;113;113;113;113;113;113;113;113;135;100;92;92
chr4 144140785 144140785 promoter-TSS GYPA uc003ijo.5;uc011cib.3;uc062zxp.1;uc062zxq.1;uc010ioq.4;uc003ijp.5;uc010ior.4;uc062zxr.1;uc062zxs.1;uc062zxt.1 33;91;115;112;149;149;149;149;91;91
chr5 34526833 34526833 promoter-TSS uc063cvq.1 531
chr5 65654961 65654961 promoter-TSS TRAPPC13 uc063eaw.1 672
chr5 154835258 154835258 promoter-TSS FAXDC2 uc063jab.1;uc063jac.1;uc063jaj.1 303;169;15
chr1 202441817 202441817 promoter-TSS PPP1R12B uc057ome.2 708
chr7 151969726 151969726 promoter-TSS GALNTL5 uc064jlx.1 888
chr8 65790050 65790050 promoter-TSS PDE7A uc003xvp.4 965
chr8 67013338 67013338 promoter-TSS PPP1R42 uc064nlb.1;uc064nle.1 356;332
chr8 141340554 141340554 promoter-TSS LINC01300 uc003ywe.3 4
chr9 97232341 97232341 promoter-TSS AL590705.2 uc064upp.1 859
chr9 131078469 131078469 promoter-TSS LAMC3 uc064wol.1 773
chr1 246306156 246306156 promoter-TSS SMYD3 uc057qyk.1 146
chr10 1159665 1159665 promoter-TSS LINC00200 uc010qag.1;uc057rfe.1 102;138
chr11 6563474 6563474 promoter-TSS DNHD1 uc057ymf.1;uc057ymg.1 602;962
chr11 13924915 13924915 promoter-TSS AC022240.1 uc057zhm.1 51
chr11 60056606 60056606 promoter-TSS MS4A3 uc001noo.4;uc058bso.1;uc001non.4;uc001nom.4;uc058bsp.1 18;21;21;76;96
chr11 71705585 71705585 promoter-TSS AP003498.1 uc058evw.1 180
chr11 71706193 71706193 promoter-TSS AP003498.1 uc058evw.1 788
chr1 31945341 31945341 promoter-TSS PTP4A2 uc057eht.1 484
chr11 116776684 116776684 promoter-TSS ZPR1 uc058hqk.1 807
chr11 119108733 119108733 promoter-TSS DPAGT1 uc031yhj.1 401
chr12 57767233 57767233 promoter-TSS CYP27B1 uc058qcl.1;uc001spz.2;uc058qco.1;uc058qcp.1 137;17;721;776
chr14 72724354 72724354 promoter-TSS DPF3 uc001xnd.3;uc059czs.1 248;582
chr15 51603818 51603818 promoter-TSS DMXL2 uc059jgn.1 116
chr16 153259 153259 promoter-TSS HBM uc059ogw.1 632
chr19 45263801 45263801 promoter-TSS MARK4 uc060zxf.1 885
chr1 94902064 94902064 promoter-TSS CNN3 uc057imh.1 190
chr22 27990887 27990887 promoter-TSS TTC28 uc062cto.1 810
chr3 9978368 9978368 promoter-TSS EMC3 uc062gqg.1 70
chr3 57661617 57661617 promoter-TSS DENND6A uc062kxv.1 790

Annotate TFBS functions

The key function of transcription factors is to regulate gene expression. By working with Genomic Regions Enrichment of Annotations Tool (GREAT), TFregulomeR allows users to annotate the functions of TFBSs using greatAnnotate. Given that GREAT server doesn’t support hg38, liftOver R package has been incorporated in TFregulomeR to convert hg38 to hg19. However, since v1.16.1, rGREAT has linked to GREAT version 4.0.4, which supports hg38. Hence, TFregulomeR has accordingly enabled a direct annotation for genomic regions in hg38 from development version v1.2.1 on (This feature will be further added in the future stable version v1.3.0). The annotation output of greatAnnotate is intuitive, not only will a data.frame containing annotation results be returned, but also an HTML report will be saved. The HTML report takes advantage of rbokeh package, which presents a vivid and dynamic interface.

# annotate the functions of K562 CEBPB exclusive peaks
# loading GREAT R package 'rGREAT'
library(rGREAT)
# the peak assembly is "hg38", and 'liftOver' is needed for conversion (no need if using TFregulomeR >= v1.2.1 and rGREAT >= 1.16.1)
library(liftOver)
# 'rbokeh' is required for an HTML report generation
library(rbokeh)

K562_CEBPB_exclusivePeak_func <- greatAnnotate(peaks = K562_CEBPB_exclusive_peak, 
                                               return_annotation = TRUE, 
                                               return_html_report = TRUE)
#> Start greatAnnotate ...
#> ... ... You chose to return annotated results in a data.frame.
#> ... ... You chose to return an HTML report.
#> ... ... assembly is hg38. Now converting to hg19 using liftOver...
#> ... ... number of the original input regions is 721
#> ... ... number of the regions successfully converted to hg19 is 721
#> ... ... start GREAT analysis
#> ... ... An html report has been generated as 'greatAnnotate_result.html'!
#> ... ... The annotation results have been returned in a data.frame!

head(K562_CEBPB_exclusivePeak_func)
#>   category         ID                             name
#> 1       MF GO:0005488                          binding
#> 2       MF GO:0005515                  protein binding
#> 3       BP GO:0065007            biological regulation
#> 4       BP GO:0050794   regulation of cellular process
#> 5       BP GO:0019222  regulation of metabolic process
#> 6       BP GO:0050789 regulation of biological process
#>   number_of_targeting_genes adjusted_pvalue
#> 1                       817    0.0001415278
#> 2                       497    0.0007221064
#> 3                       653    0.0059313854
#> 4                       602    0.0059313854
#> 5                       386    0.0072758686
#> 6                       628    0.0072758686

TFregulomeR - GreatAnnotate Result



Figure - Gene ontology analyses of targeting genes:

Table - Gene ontology analyses of targeting genes:
GO ID GO name adjusted p-value number of targeting genes
GO:0065007 biological regulation 0.0059313853615 653
GO:0050794 regulation of cellular process 0.0059313853615 602
GO:0019222 regulation of metabolic process 0.0072758685718 386
GO:0050789 regulation of biological process 0.0072758685718 628
GO:0015669 gas transport 0.0072758685718 5
GO:0080090 regulation of primary metabolic process 0.00896198756116667 351


Connect with TFBSTools

TFregulomeR is not working alone. We have built a function allowing conversion of the motif matrix in TFregulomeR warehouse to the subclass PFMatrix in TFBSTools, using toTFBSTools.

library(TFBSTools)
K562_CEBPB_TFBS_PFM <- toTFBSTools(id = "MM1_HSA_K562_CEBPB")
K562_CEBPB_TFBS_PFM
#> An object of class PFMatrix
#> ID: MM1_HSA_K562_CEBPB
#> Name: CEBPB
#> Matrix Class: Unknown
#> strand: *
#> Tags: 
#> list()
#> Background: 
#>      A      C      G      T 
#> 0.2717 0.2283 0.2283 0.2717 
#> Matrix: 
#>   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
#> A   93  305    0    0   49    0  280   91  290   533    15
#> C   61   74    0    0    0  533   59  205  242     0   215
#> G  192  141    0    0  390    0  139    3    0     0    73
#> T  187   13  533  533   94    0   55  234    1     0   230