Cis-regulatory Element Annotation System

Get started with CEAS

CEAS Overview

Figure 1 The work-flow of CEAS. A gene annotation table file, a BED file with ChIP regions, and a WIG file with ChIP enrichment signal are required. CEAS consists of three modules: ChIP region annotation, gene-centered annotation, and average profiling within/near important genomic features. As output, CEAS produces an R script of graphical results (or a PDF file if R can be directly called in the same environment as Python) and a tab-delimited with XLS extension of gene-centered annotation.

Inputs

CEAS needs three major input files: Gene annotation table
We provide RefSeq tables in sqlite3 files for several genomes (ce4 and ce6 for worm, dm2 and dm3 for fly, mm8 and mm9 for mouse, hg18 and hg19 for human). (e.g. UCSC known genes), they can download it from UCSC and convert it to a sqlite3 file using the `build_genomeBG' script before CEAS. This script, `build_genomeBG' also performs genome background annotation with the gene annotation table and the user's input WIG file and compiles it in the sqlite3 file. Therefore, if the input WIG file does not cover cover the entire tiling or mappable genomic regions (e.g. the sequencing depth is not deep enough in case of ChIP-Seq), the genome background annotation may not be accurate because a lot of regions supposed to be considered are included by the WIG file.

BED file with ChIP regions
The BED file are required to contain the chromosomes, start, and end locations of ChIP regions identified by a peak-caller (e.g. MAT for affy ChIP-chip, MA2C for Nimblegen ChIP-chip, MACS for ChIP-Seq). Other fields such as name or score are optional. An example BED file is given below. The ChIP regions do not have to be sorted in advance because CEAS sorts the regions by default.

e.g.)
chr1779600780954
chr1874824876507
chr111477451148979
chr115762701576887
chr123250392326135
chr124294362430692

WIG file with ChIP enrichment signal
The WIG file contains a continuous ChIP enrichment signal whereas the BED file shows discrete ChIP regions. Currently, CEAS supports fixedStep and variableStep WIG formats See the following examples of fixedStep and variableStep WIG files.

e.g.) fixedStep WIG
track type=wiggle_0
fixedStep chrom=chr1 start=6131 step=10
3
4
4
4
4
4
4
1
0
1
3

e.g.) variableStep WIG
track type=wiggle_0
variableStep chrom=chr1 span=1
6131 3
6141 4
6151 4
6161 4
6171 4
6181 4
6191 4
6201 1
6211 0
6221 1
6231 3

Modules

ChIP region annotation
CEAS estimates the relative enrichment level of ChIP regions in each gene feature with respect to the whole genome. For this, it first calculates the percentages of the ChIP regions that reside in the following four categories: (a) promoters, (b) bidirectional promoters, (c) downstreams of a gene, and (d) gene bodies (3'UTRs, 5'UTRs, coding exons, and introns). In addition to these categories, the user can add another extra category of interest such as non-coding regions as an optional input parameter. 'Promoters' correspond to the upstream regions of the transcription start site (TSS) of genes. Three promoter sizes can be specified by the user (1kb, 2kb, and 3kb by default). For instance, if the user set promoter to be the 1kb, 3kb, and 10kb upstream of TSS, CEAS computes the cumulative percentages of ChIP regions that fall in <=1kb, <=3kb, and <=10kb upstream of the TSS of genes, respectively. `Bidirectional promoters' are promoter regions between divergently transcribed genes whose TSS are closer in proximity than user-defined distances (two options, 2.5kb and 5kb by default). `Downstreams' refer to the regions immediately downstream of genes, spanning up to the same search range as in `promoters' from the transcription termination site (TTS). `Gene bodies' are further categorized into UTR regions (3' and 5' UTRs), coding exons and introns. After the percentages of ChIP regions respective categories, P-values for the significance of the relative enrichment with respect to the background are calculated using one-sided binomial test.
As a final summary of ChIP region annotation, CEAS draws a pie chart of how ChIP regions spread over the categories. If ChIP regions do not fall into any of the categories, they are considered to be `distal intergenic'.

Gene-centered annotation
Identifying genes associated with ChIP regions is important to infer the direct regulatory gene targets of the binding factor of interest. CEAS provides the distances to the centers of the nearest ChIP regions upstream and downstream of every RefSeq gene's TSS, allowing biologists to determine the potential target genes of the binding factor based on the distances. Moreover, in case that a broad ChIP peak covers the whole or part of a gene body, it is useful to know how much of the gene, including its promoter or downstream, is occupied by the ChIP region in addition to how far the TSS of the gene is from the ChIP region center. To this end, CEAS divides every gene into three equal fractions and, for each fraction, calculates the percentage of the area covered by ChIP regions. It also estimates the percentages of the promoter and downstream of the gene (3kb upstream of TSS and downstream of TTS by default) that are covered by ChIP regions. The results are saved as a tab-delimited text file with XLS extension for easy Excel visualization, which contains a row of annotations for every RefSeq gene.

Average signal profiling within/near important genomic features
Since ChIP region and gene-centered annotation operate on discrete ChIP regions identified by a peak-calling algorithm, some subtle binding patterns may fail to be captured, depending on the cut-off of peak calling. Therefore, CEAS displays continuous ChIP enrichment signal within/around important gene features to help biologists visualize the average binding patterns. It draws the average signals around TSS and TTS in a user-defined range (+/- 3kb from TSS and TTS by default). In addition, CEAS gives average signals on meta-gene, meta-concatenated-exon, meta-concatenated-intron, meta-exons and meta-intron, where the prefix `meta' indicates that every element (e.g every gene) is normalized residing within the above categories are obtained, they are compared against the genome background percentages for those to have the same length (e.g., 3kb for meta-gene). The difference between meta-concatenated-exon and meta-exons is that the first concatenates all exons of a gene before calculating the average gene (like a meta-cDNA) profile, whereas the latter calculates the average exon profile of all exons. These plots allow biologists to gain insight on how ChIP enrichment varies over the gene body (or exons and introns). CEAS provides an additional function of drawing the average ChIP signals of multiple sub-groups of genes (e.g. up vs. down genes), allowing the user to compare between the gene groups. In addition, we provide a separate script, named `sitepro', in our CEAS package, which draws the average signal (from a WIG) in a given list of sites of interest (from a BED). This script enables biologists to visualize the average signals in any arbitrary regions (e.g. transcription factor binding sites) in addition to the pre-defined genomic regions.

Outputs

CEAS generates two files as output. The R file contains the script to generate the graphical results of ChIP region annotation and average signal profiling within/near important genomic regions. If R can be directly called in the same environment as Python, CEAS generates an R script and its PDF file. CEAS writes the result of gene-centered annotation in a tab-delimited txt file with XLS extension for easy XLS visualization.

R script
The user can obtain the PDF of graphical results by typing the following on the command line.

$ R --vanilla < file.R

, where file.R stands for the R script produced by CEAS. As a result of running the R script, file.pdf will be generated. This step is included in CEAS if R can be directly called in the same environment as Python. XLS file
CEAS generates a tab-delimited TXT file with XLS extension for gene-centered annotation. See Example use for a detailed description of the XLS file.

Easy run

If the user wants to run CEAS in the default mode, simply type on the command line:

$ ceas -g gdb -b bed -w wig

where gdb, bed, and wig stands for a sqlite3 file with a gene annotation table and genome background annotation, a BED file with ChIP regions, and a WIG file with ChIP enrichment signal file, respectively.

For more sophisticated use of CEAS, see Use CEAS.

Use CEAS

Running modes

The CEAS running mode is determined by input file combination. As can be seen in Figure 1, ChIP region annotation and gene-centered annotation requires a gene annotation table file and a BED file whereas average signal profiling needs a gene annotation table and a WIG file. Therefore, CEAS adpatively operates the modules according to which input files are given. Of course, when all the input files are given, all of the three modules will run.

Assuming that default parameters values are selected for all other parameters,
However, it should be noted that if the user needs to re-do genome background annotation with their own WIG file CEAS requires the user to input a WIG file and set --bg as follows. See Arguments for more details.

$ ceas [options] --bg -g gdb -b bed -w wig

Arguments

Usage: ceas [options] -g gdb (-b bed | -w wig )

The following is a detailed description of the options used to control CEAS.
--versionShow program's version number and exit.
-h, --helpShow this help message and exit.
-b, --bedBED file with ChIP regions.
-w, --wigWIG file for either wig profiling or genome background annotation. WARNING: CEAS accepts fixedStep and variableStep WIG file. The user must set --bgflag for genome background annotation.
-e, --ebedBED file of extra regions of interest (e.g. non-coding regions)
-g, --gdb Gene annotation table file (e.g. a refGene table in sqlite3 db format provided through the CEAS web, http://liulab.dfci.harvard.edu/CEAS/download.html). If the sqlite3 file does not have the genome background annotation, the user must turn on --bg and have an input WIG file.
--nameExperiment name. This will be used to name the output files (R script, PDF file and XLS file). If an experiment name is not given, the stem of the input BED file name will be used instead. (e.g. if BED is peaks.bed, 'peaks' will be used as a name.) If a BED file is not given, the input WIG file name will be used.
--sizesPromoter (also downstream) sizes for ChIP region annotation. Comma-separated three integer numbers or a single number will be accepted. If a single integer is given, it will be segmented into three equal fractions (i.e. 3000 is equivalent to 1000,2000,3000). DEFAULT: 1000,2000,3000. WARNING: numbers > 10000bp are automatically fixed to 10000bp.
--bisizesBidirectional-promoter sizes for ChIP region annotation. The user can choose two numbers to define bidirectional promoters. Comma-separated two values or a single value can be given. If a single value is given, it will be segmented into two equal fractions (i.e. 5000 is equivalent to 2500,5000) DEFAULT: 2500,5000bp. WARNING: numbers > 20000bp are automatically fixed to 20000bp.
--bgRun genome BG annotation. WARNING: this flag is effective only if a WIG file is given through -w (--wig). Otherwise, ignored.
--spanSpan from TSS and TTS in the gene-centered annotation. ChIP regions within this range from TSS and TTS are considered when calculating the coverage rates of promoter and downstream by ChIP regions. DEFAULT=3000bp
--pf-resWig profiling resolution, DEFAULT: 50bp. WARNING: a number smaller than the wig interval (resolution) may cause aliasing error.
--rel-distRelative distance to TSS/TTS in wig profiling. DEFAULT: 3000bp
--gn-groupsGene-groups of particular interest in wig profiling. Each gene group file must have gene names in the 1s column. The file names are separated by commas w/ no space (e.g. --gn-groups=top10.txt,bottom10.txt)
--gn-group-namesThe names of the gene groups in --gn-groups. The gene group names are separated by commas. (e.g. --gn-group-names='top 10%,bottom 10%'). These group names appear in the legends of the wig profiling plots. If no group names given, the groups are represented as 'Group 1, Group2,...Group n'.
--gname2Whether or not use the 'name2' column of the gene annotation table when reading the gene IDs in the files given through --gn-groups. This flag is meaningful only with --gn-groups.

Example use

Here, we included two examples of CEAS runs. The first example is CD4T+ cell H3K36me3 ChIP-Seq data and the second C.elegans SDC-3 ChIP-chip (Nimblegen).

Human CD4T+ cell H3K36me3 ChIP-Seq
An example command line is as follows.

$ ceas --name=H3K36me3_ceas --pf-res=20 --gn-groups=top10.txt,bottom10.txt --gn-group-names='Top 10%,Bottom 10%' -g ./hg18/refGene -b H3K36me3_MACS_pval1e-5_peaks.bed -w H3K36me3.wig

The name of this run is `H3K36me3_ceas' (after --name). Two files of gene groups, top10.txt and bottom10.txt after --gn-groups, contain the lists of RefSeq genes that are highly expressed (top 10%) and lowly expressed (bottom 10%) in the human CD4T+ cell, respectively. The average profiles of ChIP enrichment signal on these two gene groups, which will be referred to as Top 10% and Bottom 10% (after --gn-group-names), will appear along with the average profile over all genes in the average profile plots. In this example, it is assumed that the local sqlite3 db file of hg18 RefSeq table (after -g) is ./hg18/refGene. The option, --bg, forces CEAS to re-run genome background annotation using H3K36me3.wig. However, in this special case, using the local db file is more recommendable because H3K36me3 is believed to be transcription elongation mark (gene body mark) and the WIG file may contain only gene bodies.

When running CEAS, it prints the following summary of argument selection to stdout, which helps the user to trace their parameter selections, particularly in case that the user wants to run CEAS multiple times with different parameter sets.

# PARAMETER LIST:
# name = H3K36me3_ceas
# gene annotation table = hg18/refGene
# BED file = H3K36me3_MACS_pval1e-5_peaks.bed
# WIG file = H3K36me3.wig
# extra BED file = None
# ChIP annotation = True
# gene-centered annotation = True
# average profiling = True
# re-annotation for genome background (ChIP region annotation) = False
# promoter sizes (ChIP region annotation) = 1000,2000,3000 bp
# downstream sizes (ChIP region annotation) = 1000,2000,3000 bp
# bidrectional promoter sizes (ChIP region annotation) = 2500,5000 bp
# span size (gene-centered annotation) = 3000 bp
# profiling resolution (average profiling) = 20 bp
# relative distance wrt TSS and TTS (average profiling) = 3000 bp
# gene groups (average profiling) = CD4T_top10.txt, CD4T_bottom10.txt


- ChIP region annotation and average profiling within/near important genomic features -
Figure 2 The first page shows the distribution of ChIP regions over chromosomes. The blue bars represent the percentages of the whole tiled or mappable regions in the chromosomes (genome background) and the red bars the percentages of the whole ChIP. These percentages are also marked right next to the bars. P-values for the significance of the relative enrichment of ChIP regions with respect to the gnome background are shown in parentheses next to the percentages of the red bars.For example, in the above figure, 9.4 % of ChIP regions reside in chr1 whereas 7.9 % of the whole tiled (or mappable regions) occupy chr1 with a P-value of 1.4e-5. The sum of percentages of the red bars (or blue bars, equivalently) is 100 %.
Figure 3 The second page shows the relative enrichments of ChIP regions in important genomic features, such as promoters, immediate downstreams of genes, and gene bodies, with respect to the genome background. As on the first page, the blue bars represent the percentages of the tiled or mappable regions that are located in such genomic regions (genome background) and the red bars the percentages of ChIP regions. Since H3K36me3 is a transcriptional elongation mark, it shows very high relative enrichment in gene bodies (81.9% of ChIP regions within gene bodies compared to 44.1 % of the genome background) but very low enrichment in promoters including bidrectional ones (only 1.7 % of ChIP regions within 3 kb upstreams of TSS). In addition, it was also observed that H3K36me3 has a long tail after TTS considering its high relative enrichment in the immediate downstream (see the third bar plot).
Figure 4 The third page of the CEAS graphical results illustrates how ChIP regions are distributed over important genomic features. Unlike the bar plots on the second page, the genomic features in the above pie chart are mutually exclusive (no overlaps); thus, the sum of the percentage values is 100 %. `Distal intergenic' represents the percentage of ChIP regions that do not belong in any of other genomic features.
Figure 5 The fourth page visualizes how ChIP regions are distributed over the genome along with their scores or peak heights. The line graph on the top left corner illustrates the distribution of peak heights (or scores). The red bars in the main plot ChIP regions in the input BED file. In this particular example, the maximum peak height is 40 (the largest value on the x-axis of the line). The x-axis of the main plot represents the actual chromosome sizes.
Figure 6 The fifth and sixth pages show the results of average profiling within/near important genomic features. The panels on the first row display the average ChIP enrichment signals around TSS and TTS of genes, respectively. The red, cyan, and black colors indicate the average ChIP enrichment signals of the top 10 % of the expressed genes, bottom 10%, and all RefSeq genes. On the right panel, it is observed that H3K36me3 has a tail after 3' end of genes with a nucleosome depletion at TTS, which also agrees with the observation made in Figure 3 (the bar plot of downstream). The middle panel (on the second row) represents the average ChIP signals on the meta-gene of 3 kb, which shows that H3K36me3 enriches gene bodies and increases towards the 3' end. In addition, CEAS concatenates exons of a gene before calculating the average gene (like meta-cDNA) (the left panel on the bottom row). CEAS produces a similar plot for concatenated introns as well (the right panel on the bottom row).
Figure 7 In addition to the average profiles on concatenated exons and concatenated introns, CEAS shows the average ChIP signals on all of exons and introns as well. Since exon and intron lengths highly vary from gene to gene, CEAS groups exons (or introns) into three classes by length and calculates the respect average profiles in order to avoid any potential graphical artifacts due to length-normalization.
Figure 8 This figure is a screen shot of the XLS file from gene-centered annotation that was process for better visualization. The fields of the XLS file are described below.
FieldDescription
chrChromosome of a RefSeq gene
txStartTranscription starting site (TSS) of a RefSeq gene
txEndTranscription terminating site (TTS) of a RefSeq gene
strandStrand of a RefSeq Gene
dist u txStartDistance to the nearest ChIP region (center) upstream of txStart (bp)
dist d txStartDistance to the nearest ChIP region (center) downstream of txStart (bp)
dist u txEndDistance to the nearest ChIP region (center) upstream of txEnd (bp)
dist d txEndDistance to the nearest ChIP region (center) downstream of txEnd (bp)
3kb u txStartOccupancy rate of ChIP regions in 3kb upstream of txStart (0.0 - 1.0)
3kb d txStartOccupancy rate of ChIP regions in 3kb downstream of txStart (0.0 - 1.0)
1/3 geneOccupancy rate of ChIP regions in the 1st third of a gene (0.0 - 1.0)
2/3 geneOccupancy rate of ChIP regions in the 2nd third of a gene (0.0 - 1.0)
3/3 geneOccupancy rate of ChIP regions in the 3rd third of a gene (0.0 - 1.0)
3kb d txEndOccupancy rate of ChIP regions in 3kb downstream of txEnd (0.0 - 1.0)
exonsOccupancy rate of ChIP regions in the exons (0.0-1.0)

For example, the first RefSeq gene in Figure 8 is located at chr5 180258764:180310512 at the sense strand, and the distance to the nearest ChIP region upstream of its txStart is 49,012 bp and the distance downstream of its txStart is 5 bp. About 12 % (0.121) of the 3,000 bp upstream region of the gene is occupied by ChIP region(s), and the occupancy rate decreases towards the 3' end of the gene (2 % of 1/3 gene, 0 % of the others).

C.elegans SDC3 ChIP-chip (Nimblegen)
Another example of CEAS is C.elegans SDC3 ChIP-chip (Nimblegen).

$ ceas --name=SDC3_ceas --bg --gname2 --pf-res=86 --gn-groups=ce_top10.txt,ce_middle10.txt,ce_bottom10.txt --gn-group-names='Top 10%,Middle 10%,Bottom 10%' -g ./ce4/refGene -b SDC3_MA2C_peaks.bed -w SDC3_MA2Cscore.wig -e nc_regions.bed

In this case, we decided to re-annotate the genome background (--bg) based on SDC3_MA2Cscore.wig file because the input WIG file contains all tiled locations. The option, --gname2, was set since gene IDs instead of RefSeq IDs were used in the gene group files (ce_top10.txt,ce_middle10.txt,ce_bottom10.txt after --gn-groups). It should be noted that a wrong choice of profiling resolution (--pf-res) causes artifacts in average gene profiling. In this example, the WIG file resolution (tiling space) is 86 bp while the default is 50 bp. An extra BED file, nc_regions.bed, is also given after -e (or --ebed) to perform ChIP region annotation on non-coding regions of C. elegans.

The following refers to the parameter selections of this CEAS run.

# PARAMETER LIST:
# name = SDC3_ceas
# gene annotation table = ce4/refGene
# BED file = SDC3_MA2C_peaks.bed
# WIG file = SDC3_MA2Cscore.wig
# extra BED file = nc_regions.bed
# ChIP annotation = True
# gene-centered annotation = True
# average profiling = True
# re-annotation for genome background (ChIP region annotation) = True
# promoter sizes (ChIP region annotation) = 1000,2000,3000 bp
# downstream sizes (ChIP region annotation) = 1000,2000,3000 bp
# bidrectional promoter sizes (ChIP region annotation) = 2500,5000 bp
# span size (gene-centered annotation) = 3000 bp
# profiling resolution (average profiling) = 86 bp
# relative distance wrt TSS and TTS (average profiling) = 3000 bp
# gene groups (average profiling) = ce_top10.txt, ce_middle10.txt, ce_bottom10.txt

Figure 9 SDC-3 is a transcription factor that regulates sex determination and dosage compensation in C. elegans. Therefore, it tends to bind more on X chromosome than other autosomes.
Figure 10 In this run, an extra BED file of non-coding regions was added. The bottom row (Regions of Interest) shows a high relative enrichment level of SDC-3 ChIP regions in the non-coding regions with respect to the genome background (P-value of 7.2e-202).
Figure 11 Also, this pie chart clearly shows that SDC-3 tends to bind to promoters of genes to regulate the genes (about 45 % of ChIP regions are in promoters).
Figure 12 This plot also confirms that SDC-3 has more and strong binding sites on the X chromosome than autosomes.
Figure 13 The red, green, and blue colors represent the average ChIP signal profiles on top 10 % , middle 10 %, and bottom 10 % of expressed genes in C. elegans while the black represents the average profile on all genes. As observed in the prior plots, it is shown that SDC-3 highly enriches promoters.
Figure 14 The above plots illustrate the average ChIP signals over all exons and introns.

Build a sqlite3 file with a gene annotation table and genome background annotation for CEAS

Running 'build_genomeBG'

In addition to using the sqlite3 files provided on our web site, users can build an sqlite3 file with a gene annotation table and genome background using `build_genomeBG.' For this, MySQLdb package must be installed.

$ build_genomeBG [options] -d db -g gt -w wig -o ot

In the above command line, db and gt represent a genome name of UCSC (e.g. ce4) and a gene annotation for the genome, respectively. The Python script, `build_genomeBG' directly connects to the UCSC database and downloads the gene table. The genome background annotation is performed based on the input WIG file (-w wig). This WIG file indicates the regions to consider as genome background (i.e. tiled regions in case of arrays or mappable regions in case of sequencing). The last option -o ot sets the output sqlite3 file name.

For example, suppose that the user wants to build an sqlite3 file with `sangerGene` annotation of `ce4.' The command line will be as follows. It is assumed that the same wig file as in the second CEAS example was also used here.

$ build_genomeBG -d ce4 -g sangerGene -w SDC3_MA2Cscore.wig -o sangerGene

Arguments

--versionShow program's version number and exit.
-h, --helpShow this help message and exit.
-d, --dbGenome of UCSC (e.g. hg18). If -d (--db) is not given, this script searches for a local sqlite3 referenced by -g (--gt). WARNING: MySQLdb must be installed to use the tables of UCSC and if an existing local sqlite3 file is opened, the previous tables will be reset.
-g, --gtName of the gene annotation table (or local sqlite3 file) (e.g. refGene or knownGene). If -d (--db) is given, build_genomeBG will connect to UCSC and download the specified gene table. Otherwise, build_genomeBG search for a local sqlite3 file with the name.
-w, --wigWIG file needed to obtain genome locations in BG annotation. VariableStep and fixedWig files are accepted.
-o, --otOutput sqlite3 db file name. The gene annotation table read from the local sqlite3 file or UCSC DB will be saved in a table named as 'GeneTable' and the computed genome bg annotation will be saved in two tables named as 'GenomeBGS' and 'GenomeBGP. If this option is not given, this script generates a sqlite3 file with the same name as given through -g (--gt). WARNING! When an existing local sqlite3 file is opened and saved as the same name, the tables in the file will be overwritten.
--promoterMaximum promoter size to consider for genome bg annotation. This must be >= 1000bp. Any value less than 1000bp will be set to 1000bp. DEFAULT: 10000bp
--bipromoterMaximum Bidirectional promoter size to consider for genome bg annotation. This must be >= 1000bp. Any value less than 1000bp will be set to 1000bp. DEFAULT: 20000bp
--downstreamMaximum immediate downstream size to consider for genome bg annotation. This must be >= 1000bp. Any value less than 1000bp will be set to 1000bp. DEFAULT: 10000bp
--binsizeBinsize with which to bin promoter, bidirectional promoter, and immediate downstream sizes. In each bin, the percentage of genome will be calculated. DEFAULT=1000bp