Protease-specific Substrate Database and Supplementary Files
We have compiled a caspase substrate database from multiple resources, including the CASBAH database, the MEROPS database, and the CASVM webserver, the Uniprot database, as well as the literature search. All the annotated substrate cleavage sites were verified experimentally. The current dataset contains totally 370 caspase substrate sequences and 562 cleavage sites. The complete list of these substrate sequences used in this study is available at the following links. In order to objectively evaluate the prediction performance, we employed the 5-fold cross-validation methods: substrate sequences in this dataset were randomly divided into five subsets with roughly equal numbers of substrate sequences. In each validation step, one subset was singled out in turn as the testing dataset, while the rest were used as the training dataset.
The curated caspase substrate dataset and other supplementary materials for this study can be downloaded at the following links:
* Caspase
substrate database used for the 5-fold cross-validation and leave-one-out
cross-validation (LOOCV)tests (txt file). This dataset contains 370
caspase substrate sequences and 562 cleavage site. It was constructed
by referring to multiple resources, including the Merops database, CASBAH
and the literature search. The detailed annotations are described as follows:
For each entry (starting with ">")
1. The first line denotes the Uniprot ID;
2. The second line denotes the substrate cleavage site through P4 to P4'
sites, "|" indicates the cleavage site;
3. The third line denotes the FASTA forma of the substrate sequence;
4. The fourth line denotes the predicted secondary structure by the PSIPRED
proram. "H" denotes alpha-helix, "E" denotes beta-strand,
while "C" denotes coils or loops.
5. The fifth line denotes the predicted solvent accessibility by the SCRATCH
program. "e" denotes exposed, while "b" denotes buried.
6. The last line denotes the predicted natively unstructured or disordered
regions by DISOPRED 2 program. "*" denotes disordered, while
"." denots structured or ordered.
* Independent testing
dataset containing novel caspase cleavage sites (txt file). This dataset
was extracted from a recent experimental study of Dix et al. (Cell, 2008,
134:679-691) and contains newly identified caspase substrate cleavage
sites that were not reported before.
* Distribution of
structural determinats of cleavage sites (excel file)
* Amino acid occurrenes in P4-P4'
positions (excel file)
* Distribution of secondary
structure motifs in P4-P4' positions (excel file)
* The sourcecode of Procleave program can be freely downloaded from this
link. Users who are interested in running Procleave locally on a large
number of potential protein substrates are encouraged to download this
version, instead of using the online webserver.