Protease-specific Substrate Database and Supplementary Files

We have compiled a caspase substrate database from multiple resources, including the CASBAH database, the MEROPS database, and the CASVM webserver, the Uniprot database, as well as the literature search. All the annotated substrate cleavage sites were verified experimentally. The current dataset contains totally 370 caspase substrate sequences and 562 cleavage sites. The complete list of these substrate sequences used in this study is available at the following links. In order to objectively evaluate the prediction performance, we employed the 5-fold cross-validation methods: substrate sequences in this dataset were randomly divided into five subsets with roughly equal numbers of substrate sequences. In each validation step, one subset was singled out in turn as the testing dataset, while the rest were used as the training dataset.

The curated caspase substrate dataset and other supplementary materials for this study can be downloaded at the following links:

* Caspase substrate database used for the 5-fold cross-validation and leave-one-out cross-validation (LOOCV)tests (txt file). This dataset contains 370 caspase substrate sequences and 562 cleavage site. It was constructed by referring to multiple resources, including the Merops database, CASBAH and the literature search. The detailed annotations are described as follows:
For each entry (starting with ">")
1. The first line denotes the Uniprot ID;
2. The second line denotes the substrate cleavage site through P4 to P4' sites, "|" indicates the cleavage site;
3. The third line denotes the FASTA forma of the substrate sequence;
4. The fourth line denotes the predicted secondary structure by the PSIPRED proram. "H" denotes alpha-helix, "E" denotes beta-strand, while "C" denotes coils or loops.
5. The fifth line denotes the predicted solvent accessibility by the SCRATCH program. "e" denotes exposed, while "b" denotes buried.
6. The last line denotes the predicted natively unstructured or disordered regions by DISOPRED 2 program. "*" denotes disordered, while "." denots structured or ordered.

* Independent testing dataset containing novel caspase cleavage sites (txt file). This dataset was extracted from a recent experimental study of Dix et al. (Cell, 2008, 134:679-691) and contains newly identified caspase substrate cleavage sites that were not reported before.

* Distribution of structural determinats of cleavage sites (excel file)

* Amino acid occurrenes in P4-P4' positions (excel file)

* Distribution of secondary structure motifs in P4-P4' positions (excel file)

* The sourcecode of Procleave program can be freely downloaded from this link. Users who are interested in running Procleave locally on a large number of potential protein substrates are encouraged to download this version, instead of using the online webserver.