NPS (Nucleosome Positioning from Sequencing)

 

Overview

NPS is a python software package that can identify nucleosome positions given histone-modification ChIP-seq or nucleosome sequencing at the nucleosome level. NPS obtains continuous wave-form that represents the enrichment of histone modifications (or nucleosomes) by extending each tag (25nt, Solexa) to 150nt in the 3Õ direction and taking the middle 75, and detects the positions of nucleosomes based on Laplacian of Gaussian (LOG) edge detection. The p value of each detection was estimated using Poisson approximation and the user can decide a cut-off for the final selection of nucleosome positions. In case of histone modification, the sequence tags are regrouped by different types of histone modification after nucleosome positioning and then the p-value of a particular histone modification at a positioned nucleosome was calculated based on the tag count of that histone modification in the nucleosome region using Poisson distribution, similar to the method mentioned above. The user also can select a cut-off of p value in histone modification assignment.

 

Paper

Our paper has recently been accepted by BMC Genomics. For details, please refer to

 

Zhang Y, Shin H, Song JS, Lei Y, Liu XS. Identifying Positioned Nucleosomes with Epigenetic Marks in Human from ChIP-Seq. BMC Genomics 2008, 9:537. (see http://www.biomedcentral.com/1471-2164/9/537)

 

Our experimental results can be obtained at http://liulab.dfci.harvard.edu/NPS/Result/.

 

 

Installation

The latest version of NPS is 1.0.3.2 and it is available as an open source (download). NPS is developed using Python 2.5 and three Python packages, NumPy, RPy, and Pywavelets, must be installed along with R for the proper running of NPS. RPy2 and R 2.7.1 are not compatible with the current version of NPS.

 

NPS-1.0.3.2 provides Ôhg19Õ and Ômm9Õ in addition to Ôhg18Õ and Ômm8.Õ

 

Usage

After downloading and unpacking the NPS package, go to the folder and type

 

Ôpython SeqTag.py SeqTag.parÕ

 

on the command window to run NPS. SeqTag.par is a text file that contains the self-explaining parameters needed to be set for NPS. The following is an example parameter file, which was actually used in our study.

 

Note that minimum peak width (MIN_WIDTH) is recommended to be larger than tag extension (EXTENSION) for estimating accurate pvalue. In the most recent version (NPS-1.0.2), when peaks whose widths are shorter than extension are met, a warning is raised.



#############################################################################

# Parameters for NPS.

#

 

#

# Input and output files

#

INFILE = HM_hg18.bed

OUTFILE = HM_hg18_peak.bed

 

 

 

# input tag file

# output file of identified nucleosome positions

#

# Preprocessing

#

SPENAME = hg18

EXTENSION = 75

SHIFT = 37

WANT_SORT = yes

 

 

 

 

# species name

# each Tag is extended to 150nt and

# the middle 75 nt is taken

# yes: sorting the tags, no: not sorting the tags

#

# Wavelet denoising

#

WANT_DENOISE = yes

DECOMP_LEVEL = 2

WAVELET = coif4

THRESHOLD_EST = heursure

THRESHOLD_TYPE = soft

SCALE = mln

 

 

 

 

# yes: doing the wavelet denoising, no: skipping it

# level 2 wavelet decomposition

# use coflet4 for wavelet denoising

# denoising threshold selection using SURE

# soft thresholding

# multi-level denoising

#

# Peak finding

#

INTERVAL = 10

PVALUE = 1e-5 

TAG_NUM = 186e6

LOG = 3

SLOPE = 2

MIN_WIDTH = 80

MAX_WIDTH = 250

MIN_HEIGHT =

MAX_HEIGHT = 10000

PEAK_INFLECTION_RATIO = 1.2

MED_WSIZE = 5

BIAS_RATIO = 4

 

 

 

 

# 10nt sampling for efficient signal processing

# p value cut-off for identifying nucleosomes

# the total tag numbers

# level of Gaussian smoothing in peak detection

 

# minimum peak width

# maximum peak width

# minimum peak height

# maximum peak height

# minimum peak to inflection point ratio

# median filter window size

# allowable ratio between + tags and – tags

 

 

 

Identified Peaks (Nucleosomes)

The output file (OUTFILE in *.par file) is a tab-delimited file and contains the list of identified peaks (nucleosomes) as follows. The first line of the output file is a header line that indicates each of the 5 columns. Note that the p-values of the identified peaks (nucleosomes) are given as log P values (-10*log10(pvalue)).

 

chr

start

end

name

-10*log10(pvalue)

chr21

9719971

9720101

nucleosome1

1.22E+02

chr21

9720591

9720671

nucleosome2

5.36E+01

chr21

9721111

9721191

nucleosome3

6.99E+01

chr21

9721341

9721421

nucleosome4

1.87E+02

chr21

9722171

9722251

nucleosome5

6.43E+01

chr21

9722311

9722391

nucleosome6

6.99E+01

chr21

9723031

9723111

nucleosome7

2.44E+02

chr21

9723531

9723631

nucleosome8

2.31E+02

chr21

9723971

9724051

nucleosome9

2.70E+02