Home HanLab

Welcome to SeqSpider!

1. Introduction
2. Requirments
3. Installation
4. Usage
5. Download
6. FAQs

1. Introduction

Deep sequencing has spurred genome-wide mapping of transcription factor binding sites, histone modifications and DNA methylations. However, due to experimental variations, there are still no computational tools that can universally accept and seamlessly integrate different types (discrete/real/profile) of deep sequencing data from different sources, to predict regulatory networks. Here, we have developed a Bayesian network inference algorithm 'SeqSpider' to address this bottleneck. SeqSpider is the first Bayesian network algorithm that enables learning from tag distributions, a unique feature of ChIP-Seq and bisulfite sequencing data, and combined with a profile clustering method for noise removal, enables ab initio identification of interactions from multiple sources of heterogeneous data. SeqSpider correctly predicted the interactions between DNA methylation, histone modifications, gene expression, transcription factors and chromatin modification complexes as well as their underlying motif interactions using datasets of two human embryonic stem cell lines from three laboratories. Furthermore, the inferred network model predicts an intriguing enhancer-promoter interaction mechanism, where H3K4me3 serves as a signal relaying hub for information propagation among different epigenetic modification and regulatory domains. Details please refer to the paper: Liu, Y., N. Qiao, et al. (2013). " A novel Bayesian network inference algorithm for integrative analysis of heterogeneous deep sequencing data." Cell Res.

Anyone can use the source codes, documents or the excutable file of SeqSpider free of charge for non-commercial use. For commercial use, please contact the author.

2. Requirments

Linux x64   http://www.ubuntu.com/
Perl   http://www.perl.org/

3. Installation

nstallation of SeqSpider only requires unpacking the files in the "seqspider.zip" file on any Linux platform and adding the directory to the PATH.

4. Usage

(1) SeqSpider.pl
The program "SeqSpider.pl" build Bayesian network of TFs/Histone modifications/DNA methylations from the raw deep sequencing reads (usually represented in the BED format files) and a REFFLAT file (indicating the position of the TSS sites of the genes, which could be downloaded from UCSC genome browser) as input. The following example uses the REFFLAT file "data/hg18/refFlat.txt" and several BED format files under the "data" directory as input, SuperKmeans is used to group the TSS profiles into 1000 clusters, the output file is "test_matrix.txt".
Usage: perl SeqSpider.pl --help
--refSeq   list of target elements in refseq format, refFlat.txt in hg18 is included in the package (default). If your bed files are based on other genome or other assembly of human, you should download corresponding file from UCSC.
--methyFiles   REQUIRED, use comma "," to delimit multiple inputs, wildcard "*" supported, eg: -methy "data1/*.bed,data2/*.bed"
--shift   shift size towards 3' end of short reads (default:0)
--debug    to debug the pipeline
--output   FILENAME ( "TSS_matrix.txt")
--reg_factor   reg_factor
--percent   percent of each group (default 0.9)
--SKcluster   do super K mean cluster, with n groups
--SKtryN   SK Trial_num, default 20
--Help   Show this message

Example usage:
perl SeqSpider.pl --refSeq data/hg18/refFlat.txt --methyFiles "data/*.bed" --output test_matrix.txt --SKcluster 1000

(2) exeABCD.pl
The program "exeABCD.pl" build Bayesion network from a matrix file, the matrix file is tab separated, each row represents a gene, each column a node (regulator), if a node is represented by a vector (such as a regulator is divided into 10 bins at TSS region), the header of the columns should be the same, as shown in the example file "data/human_ESC_regulaters.tsv".
Usage: perl exeABCD.pl --help
--input    FILENAME ( "*_matrix.txt")
--reg_factor   reg_factor (default 3)
--is_wise    is_wise for nips (default 2)
--percent    percent of each group (default 0.9) or exact gene numbers if >2
--if_normalized    if do column-wize normalization (default 1)
--pseudo_count    used in normalization (default 1, for chip-seq data, 1 is better.)
--SKcluster    do super K mean cluster, with n groups
--SKmaxIter    max iteration for SKmeans,default 400
--SKtryN    SK Trial_num, default 20
--mix    set data types for each node. (default: estimated from the column names of the input file, such as "10 10 8 1", means there are 29 columns in the input data. )The first two nodes are 10-dimention vetor data with same column names; the third node is a 8-dimentional vector data; and the last node is a continous data. In addition,you can manualy add more discritized data, as a form of "10 10 8 1 0 0",total 31 columns);

Example usage:
perl exeABCD.pl --input data/test_matrix.txt -SKcluster 1000

5. updates

version 1.01
1. fix file "data/test_matrix.txt";
2. change file "exeABCD.pl" to a link to avoid bad library reference;
3. change the usage of "exeABCD.pl" to "perl exeABCD.pl --input data/test_matrix.txt -SKcluster 1000 ";

6. Download

SeqSpider could be downloaded from this link: Download

7. FAQs

Wait for your questions. Please feel free to contact us if anything unclear.

1) How to reproduce the hESC BN descript in the paper?
   Run the command "sh test.sh"

2) How many nodes are supported by SeqSpider?
   Currently, SeqSpider supports inferring BNs with less than 100 nodes.

3) How to cite SeqSpider?
   SeqSpider users please cite the following paper: Liu, Y., N. Qiao, et al. (2013). " A novel Bayesian network inference algorithm for integrative analysis of heterogeneous deep sequencing data." Cell Res.

4) For those *.bed files located at (/data/*.bed), what are the values in fourth and fifth column used for?
   The fourth column in the testing data is not used. The fifth column means raw counts in the certain region, and will be used as signals in TSS sites. Please check the script "scan_bed_for_TSS.pl" in tools directory for more information. Normalization has been considered in the later learning process.

5) What's the output of SeqSpider?
   SeqSpider output 2 files: one "SIF" file named "*_overlap.sif" contain the edge information, could be opened in text editor or EXCEL, or imported into Cytoscape for network visulization; one ROC file named "*Roc_D.txt" contain the ROC curve to evaluate the network stability, could be opened in Excel.

6) What's "reg"/"noreg" mean in "SIF" file?
   "reg " label the edge direction, "noreg " label the edge have no direction.

7) How can I get the dashed edges in Figure S20?
   The method for recovering feedback edges (dashed edges in Figure 20) is presented in Suppl. text page 9. As the program is not friendly for users, we currently haven't wrap the program in the Perl scripts.